BPA-OTU is a web-based portal into Operational Taxonomic Unit (OTU) data, developed to access data from the Australian Microbiome.
ckanapi
(e.g. sample site images and sample
metagenome data). For this reason the docker containers (at least runserver
and celeryworker
) need to run with a valid CKAN_API_KEY environment
variable (see ./.env_local and ./docker-compose.yml).bpa-ingest
(maintained externally).
The version of bpa-ingest
used is maintained in the runtime-requirements.txt
file.
When updating the AM metadata schema, the bpa-ingest
repository requires changes.
These changes will be associated with a git tag by the bpa-ingest
team for the new version.
The entry in runtime-requirements.txt
must be updated to use the version at this new tag.
Note: This dependency was handled previously as a git submodule../
as a volume, which means that Django will monitor all of
its *.py files and restart when they are updated outside of the container.docker compose
) does not seem to work with the docker-compose-build.yml file, but the older executable (docker-compose
) does workGenerate ./.env_local
. This should contain KEY=value
lines. See ./.env
for keys. This must have a valid CKAN_API_KEY
so that site images and sample
metagenome data can be fetched during development. You can use your personal
CKAN_API_KEY
in the development environment. This key can be found on the
profile page after logging on to the bioplatforms.com data portal.
Note that .env_local
is used to supply environment variables to the backend
running in a docker container. Don't confuse this with the various .env.*
files that can be used by React to supply environment variables to the
frontendinfo
In particular, the only purpose of ./.env
is to document the available keys
for manual generation of ./.env_local
.
Ensure that other keys have a value set so the page will work (dummy values are fine). In particaular, CKAN_DEVEL_USER_EMAIL and BPAOTU_AUTH_SECRET_KEY need values, and possibly others.
Build the docker images
docker-compose -f docker-compose-build.yml build base dev
Start all of the containers
docker-compose up
There are 4 containers: runserver, db, cache, celeryworker
If the local machine already has a postgresql server instance it will need to be stopped, since the ports will conflict (sudo service postgresql stop
)
This will start the docker containers attached to the current terminal process. If you want the containers to persist running after closing the terminal, start the containers with the -d argument:
docker-compose up -d
And then manage the containers with usual docker commands (docker-compose ps
, docker-compose stop
, docker-compose start
)
Once the BE is operational it's possible to do a data ingest. This is described in detail in the Input data description section. For quick reference:
/path/to/bpaotu
is the app root (i.e. where docker-compose.yml is)
Extract the ingest archive to /path/to/bpaotu/data/dev
tar -zxvf </path/to/dataarchive.tar.gz> -C /path/to/bpaotu/data/dev
Update the sample contextual database for the import
cp /path/to/bpaotu/data/dev/$ingest_dir/db/AM_db_* /path/to/bpaotu/data/dev/amd-metadata/amd-samplecontextual/
Run the otu_ingest management task on the app container
docker-compose exec runserver bash
/app/docker-entrypoint.sh django-admin otu_ingest $ingest_dir $yyyy-mm-dd --use-sql-context --no-force-fetch
Where: $ingest_dir is the directory of the extracted ingest archive (note: tab complete will work here), $yyyy-mm-dd is the date of the ingest (i.e. today's date)
These steps are performed in a separate terminal, i.e. not in the container, and from the frontend/
directory.
Install node
frontend/package.json
under the `"engines"`` propertynvm
(Node Version Manager)nvm install x.y.z
frontend/
directory called .nvmrc
that specifies the version of node to be used for this project in the event that the local system has multiple versions of node.Install yarn
npm install -g yarn
Install node modules for the web app
yarn install
to install the node modulesStart the React frontend
yarn start
BPA-OTU loads input data to generate a PostgreSQL schema named otu
. The
importer functionality completely erases all previously loaded data.
Three categories of file are ingested:
.xlsx
for Excel file [default] or .db
for SQLite DB).taxonomy
).txt
)Note that /data/dev
is a mount point in a Docker container. See ./docker-compose.yml
By default the contextual metadata will be downloaded during the ingest operation, or it can be provided as either a sqlite database or an Excel spreadsheet
./data/dev/amd-metadata/amd-samplecontextual/*.db # sqlite database
./data/dev/amd-metadata/amd-samplecontextual/*.xlsx # Excel spreadsheet
See "Additional arguments" below for more context on these.
Abundance and taxonomy files must be placed under a base directory for the particular ingest $dir
, which is under the mount point for the Docker container, structured as follows:
./data/dev/$dir/$amplicon_code/*.txt.gz
./data/dev/$dir/$amplicon_code/*.$classifier_db.$classifier_method.taxonomy.gz
$classifier_db
and $classifier_method
describe the database and method used to
generate a given taxonomy. They can be arbitrary strings.
The ingest is then run as a Django management command. To run this you will need to shell into the runserver container
cd ~/bpaotu # or wherever docker-compose.yml lives
# either this
docker-compose exec runserver bash
# or this
docker exec -it bpaotu_runserver_1 bash
## Either ingest using local sqlite db file for contextual metadata...
root@05abc9e1ecb2:~# /app/docker-entrypoint.sh django-admin otu_ingest $dir $yyyy_mm_dd --use-sql-context --no-force-fetch
## or download contextual metadata and use that:
root@420c1d1e9fe4:~# /app/docker-entrypoint.sh django-admin otu_ingest $dir $yyyy_mm_dd
If
docker-compose exec runserver bash
does not work, then find the id of the container withdocker container ls
(the system will need to be running for this to work, i.e. withdocker-compose up
) and then rundocker exec -it 2361ab2339af bash
(name will be different for the reader)
$dir
is the base directory for the abundance and taxonomy files.
$yyyy_mm_dd
is the ingest date .e.g. 2022-01-01
Example usage:
Get data file, unarchive and copy data to ./data/dev, and ingest data using a particular date:
cd ./data/dev
tar -xvzf </path/to/dataarchive.tar.gz> ./
cd ~/bpaotu # or wherever docker-compose.yml lives
docker-compose exec runserver bash
/app/docker-entrypoint.sh django-admin otu_ingest AM_data_db_submit_202303211107/ 2023-11-29 --use-sql-context --no-force-fetch
Additional arguments:
NOTE: the order is important if supplying both of these arguments
This file describes sample specific metadata. The current schema of the contextual metadata can be found here
A gzip-compressed tab-delimited file with extension .taxonomy.gz
The first row of this file must contain a header. The required header fields are:
#OTU ID\tkingdom\tphylum\tclass\torder\tfamily\tgenus\tspecies\tamplicon\ttraits
or
#OTU ID\tkingdom\tsupergroup\tdivision\tclass\torder\tfamily\tgenus\tspecies\tamplicon\ttraits
Each column value is an arbitrary character string, with the following restrictions:
NB: Taxonomic ranks must be forward filled with last known field assignment if empty (e.g. dbacteria, dbacteria_unclassified, dbacteria_unclassified, dbacteria_unclassified, dbacteria_unclassified, dbacteriaunclassified, d\_bacteria_unclassified)
Example:
hou098@terrible-hf:~/bpaotu$ zcat data/dev/202203050842/16S/16S_PWSW_seqs_listSET_OTU_taxon_20220304_withAMPLICON_FAPROTAXv124.silva132.SKlearn.taxonomy.gz | head -4
#OTU ID confidence kingdom phylum class order family genus species amplicon traits
GATTGGCTCACGGACGCAAAACCACCAAAAAACACGTGACGTTACTGGTTGTCCGTCCTTTTGGTTTTTTTGCCCTTCTATGGTAATGCTATGAGTGCTTTTTGCAAAATGCTGCTCTGGGATTCGCTCCCGAACGCAACGCGCTACCTATTACTACTATCATAATTACATCACGCAAATTCAGGAGCTCATCAATGGTGAGCCAGCCAAGTTCATTCAAGATAGGTGAAATATGATCAAATTTCTTAGTATTAGTCAAAATACGGGCAGCAAAATTTTGTATAAGTTGTAGTTTATGAACATTATCCTTTGAAGTCCCAGACCATACAGTAGAACAGTAAAATAATTTACTAAAAACTAGTGAATTCAAAATGGTGTTCAATACCTCTCTAGAAAATAGGTGACGGACTCTATTTACTTGACATAAAGTAGATAAAAGGGAAGAACTAAGTGATGTAACGTAGTCATTAAAGTTAAAGTTCGAGTCTAGCAGAAGCCACGGGTTTTAACTCTTGACCAAGAAAAGGCACAGTGACATCTGGGAGCTGAGATAGGAGCTGTCTTACTCCGAA 0.4340600531226606 d__Unassigned d__Unassigned_unclassified d__Unassigned_unclassified d__Unassigned_unclassified d__Unassigned_unclassified d__Unassigned_unclassified d__Unassigned_unclassified 27f519r_bacteria
AACGAACGCCGGCGGCGTGCTTAACACATGCAAGTCGAACGCGAAAGCCTGGGCAACTGGGCGAGTAGAGTGGCGAACGGGTGAGTAATACGTGAGTAACCTGCCCTTGAGTGGGGAATAACTCCTCGAAAGGGGAGCTAATACCGCATAAGACCACGACCCCGATGGGAGTTGCGGTCAAAGGTGGCCTCATGCACCAGAGCGTTTGGGCACAGATTCTGCGTGCCGGAAAAGAATCTGTACCCCAGCGCTTTGTCAGTGAAGCTATCGCTTGAGGAGGGGCTCGCGGCCCATCAGCTAGTTGGTAGGGTAATGGCCTACCAAGGCGACGACGGGTAGCTGGTCTGAGAGGACGACCAGCCACACGGGAATTGAGAGACGGTCCCGACTCCTACGGGAGGCAGCAGTGGGGAATCTTGGGCAATGGGGGAAACCCTGACCCAGCGACGCCGCGTGGGGGATGAAGGCCTTCGGGTTGTAAACCCCTGTTCGGTGGGACGAACATCTTCCCATGAACAGTGGGAAGATTTGACGGTACCACCAGAGTAAGCCCCGGCTAACTCCGTGC 0.9999802845765206 d__Bacteria d__Bacteria_unclassified d__Bacteria_unclassified d__Bacteria_unclassified d__Bacteria_unclassified d__Bacteria_unclassified d__Bacteria_unclassified 27f519r_bacteria
GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGTAACAGGCTTTCACTGTTTACTGCTCTTCTTTCGATATGGAGCAAAGGTTTTCCAAACCTTATTCCTAACGGAGGAGTATCATCTCGTACTTTGACCTAGTCAAGATACGAAATGTAGAGAAGTGAAGAGTGAAAGTGCTGACGAGTGGCGGACGGCTGAGTAACGCGTGGGAACGTGCCCCAAAGTGAGGGATAAGCACCGGAAACGGTGTCTAATACCGCATATGATCTTCGGATTAAAGCAGAAATGCGCTTTGGGAGCGGCCCGCGTTGGATTAGGTAGTTGGTGAGGTAAAGGCTCACCAAGCCGACGATCCATAGCTGGTCTGAGAGGATGACCAGCCAGACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTCCACAATGGGCGAAAGCCTGATGGAGCAACGCCGCGTGCAGGATGAAGGCCTTAGGGTCGTAAACTGCTTTTATTAGTGAGGAATATGACGGTAACTAATGAATAAGGGTCGGCTAACTACGTGC 0.8979041295444753 d__Bacteria p__Patescibacteria c__Saccharimonadia o__Saccharimonadales f__Saccharimonadales g__Saccharimonadales g__Saccharimonadales_unclassified 27f519r_bacteria
A gzip-compressed tab-delimited file with the extension .txt.gz
The first row is a header, with the following format:
#OTU ID\tSample_only\tAbundance\tAbundance_20K
Each column has the following format:
#OTU ID
: text string, corresponding to the strings in the taxonomy fileSample_only
: the identifier for the sample ID for which this column specifies abundanceAbundance
(floating point) : the abundance of the OTU in the sampleAbundance_20K
(integer): the abundance of the OTU in the sample after
randomly sub-sampling 20,000 reads.Missing values for Abundance
or Abundance_20K
are indicated by empty
strings. Abundance
can be the last field on the line if Abundance_20K
is
missing.
Example:
#OTU ID Sample_only Abundance Abundance_20K
AAAAGAAGTAAGTAGTCTAACCGCAAGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGG 21646 17
AAAAGAAGTAAGTAGTCTAACCGTTTACGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGG 21653 14
AAAAGAAGTAGATAGCTTAACCTTCGGGAGGGCGTTTACCACTTTGTGATTCATGACTGGGG 21644 70 2
To generate an SVG diagram of the database schema, install the
postgresql-autodoc
and graphviz
packages (Ubuntu), and then
PGPASSWORD=$db_password postgresql_autodoc -d webapp -h localhost -u webapp -s otu
dot -Tsvg webapp.dot > webapp.svg
Start a bash terminal on the db container and run log into psql with the webapp role:
psql -U webapp
Then set the search path to the "otu" schema at the psql prompt
SET search_path TO otu;
There is a script to test the output of the OTU and Contextual Download
feature. This counts and displays the number of unique OTU hashes in the OTU.fasta file, the number of unique Sample IDs in the contextual.csv file, and for each domain .csv file, counts and displays the number of unique OTU hashes and unique Sample IDs. The results can then be inspected to ensure they are as expected for the given search.
To run, download a search, extract the results to a directory, cd to that directory and run the script:
. /path/to/bpaotu/test/verify-otu-contextual-export.sh
Bioplatforms Australia - Australian Microbiome Search Facility
Copyright © 2017, Bioplatforms Australia.
BPA OTU is released under the GNU Affero GPL. See source for a licence copy.