BioplatformsAustralia / bpaotu

OTU database access for the Australian Microbiome
GNU Affero General Public License v3.0
5 stars 1 forks source link

Bioplatforms Australia - Operational taxonomic unit (OTU) query system

BPA-OTU is a web-based portal into Operational Taxonomic Unit (OTU) data, developed to access data from the Australian Microbiome.

System overview

Development environment setup

Backend (Django)

Ingest

Once the BE is operational it's possible to do a data ingest. This is described in detail in the Input data description section. For quick reference:

/path/to/bpaotu is the app root (i.e. where docker-compose.yml is)

Frontend (React)

These steps are performed in a separate terminal, i.e. not in the container, and from the frontend/ directory.

Input data description

BPA-OTU loads input data to generate a PostgreSQL schema named otu. The importer functionality completely erases all previously loaded data.

Three categories of file are ingested:

Note that /data/dev is a mount point in a Docker container. See ./docker-compose.yml

By default the contextual metadata will be downloaded during the ingest operation, or it can be provided as either a sqlite database or an Excel spreadsheet

./data/dev/amd-metadata/amd-samplecontextual/*.db # sqlite database
./data/dev/amd-metadata/amd-samplecontextual/*.xlsx # Excel spreadsheet

See "Additional arguments" below for more context on these.

Abundance and taxonomy files must be placed under a base directory for the particular ingest $dir, which is under the mount point for the Docker container, structured as follows:

./data/dev/$dir/$amplicon_code/*.txt.gz
./data/dev/$dir/$amplicon_code/*.$classifier_db.$classifier_method.taxonomy.gz

$classifier_db and $classifier_method describe the database and method used to generate a given taxonomy. They can be arbitrary strings.

The ingest is then run as a Django management command. To run this you will need to shell into the runserver container

cd ~/bpaotu # or wherever docker-compose.yml lives
# either this
docker-compose exec runserver bash
# or this
docker exec -it bpaotu_runserver_1 bash

## Either ingest using local sqlite db file for contextual metadata...
root@05abc9e1ecb2:~# /app/docker-entrypoint.sh django-admin otu_ingest $dir $yyyy_mm_dd --use-sql-context --no-force-fetch

## or download contextual metadata and use that:
root@420c1d1e9fe4:~# /app/docker-entrypoint.sh django-admin otu_ingest $dir $yyyy_mm_dd

If docker-compose exec runserver bash does not work, then find the id of the container with docker container ls (the system will need to be running for this to work, i.e. with docker-compose up) and then run docker exec -it 2361ab2339af bash (name will be different for the reader)

$dir is the base directory for the abundance and taxonomy files.

$yyyy_mm_dd is the ingest date .e.g. 2022-01-01

Example usage:

Get data file, unarchive and copy data to ./data/dev, and ingest data using a particular date:

cd ./data/dev
tar -xvzf </path/to/dataarchive.tar.gz> ./

cd ~/bpaotu # or wherever docker-compose.yml lives
docker-compose exec runserver bash
/app/docker-entrypoint.sh django-admin otu_ingest AM_data_db_submit_202303211107/ 2023-11-29 --use-sql-context --no-force-fetch

Additional arguments:

NOTE: the order is important if supplying both of these arguments

Contextual Metadata

This file describes sample specific metadata. The current schema of the contextual metadata can be found here

Taxonomy files

A gzip-compressed tab-delimited file with extension .taxonomy.gz

The first row of this file must contain a header. The required header fields are:

#OTU ID\tkingdom\tphylum\tclass\torder\tfamily\tgenus\tspecies\tamplicon\ttraits

or

#OTU ID\tkingdom\tsupergroup\tdivision\tclass\torder\tfamily\tgenus\tspecies\tamplicon\ttraits

Each column value is an arbitrary character string, with the following restrictions:

NB: Taxonomic ranks must be forward filled with last known field assignment if empty (e.g. dbacteria, dbacteria_unclassified, dbacteria_unclassified, dbacteria_unclassified, dbacteria_unclassified, dbacteriaunclassified, d\_bacteria_unclassified)

Example:

hou098@terrible-hf:~/bpaotu$ zcat  data/dev/202203050842/16S/16S_PWSW_seqs_listSET_OTU_taxon_20220304_withAMPLICON_FAPROTAXv124.silva132.SKlearn.taxonomy.gz  | head -4
#OTU ID confidence  kingdom phylum  class order family  genus species amplicon  traits
GATTGGCTCACGGACGCAAAACCACCAAAAAACACGTGACGTTACTGGTTGTCCGTCCTTTTGGTTTTTTTGCCCTTCTATGGTAATGCTATGAGTGCTTTTTGCAAAATGCTGCTCTGGGATTCGCTCCCGAACGCAACGCGCTACCTATTACTACTATCATAATTACATCACGCAAATTCAGGAGCTCATCAATGGTGAGCCAGCCAAGTTCATTCAAGATAGGTGAAATATGATCAAATTTCTTAGTATTAGTCAAAATACGGGCAGCAAAATTTTGTATAAGTTGTAGTTTATGAACATTATCCTTTGAAGTCCCAGACCATACAGTAGAACAGTAAAATAATTTACTAAAAACTAGTGAATTCAAAATGGTGTTCAATACCTCTCTAGAAAATAGGTGACGGACTCTATTTACTTGACATAAAGTAGATAAAAGGGAAGAACTAAGTGATGTAACGTAGTCATTAAAGTTAAAGTTCGAGTCTAGCAGAAGCCACGGGTTTTAACTCTTGACCAAGAAAAGGCACAGTGACATCTGGGAGCTGAGATAGGAGCTGTCTTACTCCGAA  0.4340600531226606  d__Unassigned d__Unassigned_unclassified  d__Unassigned_unclassified  d__Unassigned_unclassified  d__Unassigned_unclassified  d__Unassigned_unclassified  d__Unassigned_unclassified  27f519r_bacteria
AACGAACGCCGGCGGCGTGCTTAACACATGCAAGTCGAACGCGAAAGCCTGGGCAACTGGGCGAGTAGAGTGGCGAACGGGTGAGTAATACGTGAGTAACCTGCCCTTGAGTGGGGAATAACTCCTCGAAAGGGGAGCTAATACCGCATAAGACCACGACCCCGATGGGAGTTGCGGTCAAAGGTGGCCTCATGCACCAGAGCGTTTGGGCACAGATTCTGCGTGCCGGAAAAGAATCTGTACCCCAGCGCTTTGTCAGTGAAGCTATCGCTTGAGGAGGGGCTCGCGGCCCATCAGCTAGTTGGTAGGGTAATGGCCTACCAAGGCGACGACGGGTAGCTGGTCTGAGAGGACGACCAGCCACACGGGAATTGAGAGACGGTCCCGACTCCTACGGGAGGCAGCAGTGGGGAATCTTGGGCAATGGGGGAAACCCTGACCCAGCGACGCCGCGTGGGGGATGAAGGCCTTCGGGTTGTAAACCCCTGTTCGGTGGGACGAACATCTTCCCATGAACAGTGGGAAGATTTGACGGTACCACCAGAGTAAGCCCCGGCTAACTCCGTGC  0.9999802845765206  d__Bacteria d__Bacteria_unclassified  d__Bacteria_unclassified  d__Bacteria_unclassified  d__Bacteria_unclassified  d__Bacteria_unclassified  d__Bacteria_unclassified  27f519r_bacteria
GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGTAACAGGCTTTCACTGTTTACTGCTCTTCTTTCGATATGGAGCAAAGGTTTTCCAAACCTTATTCCTAACGGAGGAGTATCATCTCGTACTTTGACCTAGTCAAGATACGAAATGTAGAGAAGTGAAGAGTGAAAGTGCTGACGAGTGGCGGACGGCTGAGTAACGCGTGGGAACGTGCCCCAAAGTGAGGGATAAGCACCGGAAACGGTGTCTAATACCGCATATGATCTTCGGATTAAAGCAGAAATGCGCTTTGGGAGCGGCCCGCGTTGGATTAGGTAGTTGGTGAGGTAAAGGCTCACCAAGCCGACGATCCATAGCTGGTCTGAGAGGATGACCAGCCAGACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTCCACAATGGGCGAAAGCCTGATGGAGCAACGCCGCGTGCAGGATGAAGGCCTTAGGGTCGTAAACTGCTTTTATTAGTGAGGAATATGACGGTAACTAATGAATAAGGGTCGGCTAACTACGTGC 0.8979041295444753  d__Bacteria p__Patescibacteria  c__Saccharimonadia  o__Saccharimonadales  f__Saccharimonadales  g__Saccharimonadales  g__Saccharimonadales_unclassified 27f519r_bacteria

Abundance files

A gzip-compressed tab-delimited file with the extension .txt.gz

The first row is a header, with the following format:

#OTU ID\tSample_only\tAbundance\tAbundance_20K

Each column has the following format:

Missing values for Abundance or Abundance_20K are indicated by empty strings. Abundance can be the last field on the line if Abundance_20K is missing.

Example:

#OTU ID Sample_only Abundance Abundance_20K
AAAAGAAGTAAGTAGTCTAACCGCAAGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGG  21646 17
AAAAGAAGTAAGTAGTCTAACCGTTTACGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGG  21653 14
AAAAGAAGTAGATAGCTTAACCTTCGGGAGGGCGTTTACCACTTTGTGATTCATGACTGGGG  21644 70  2

Database visualisation

To generate an SVG diagram of the database schema, install the postgresql-autodoc and graphviz packages (Ubuntu), and then

PGPASSWORD=$db_password postgresql_autodoc -d webapp -h localhost -u webapp -s otu
dot  -Tsvg webapp.dot  > webapp.svg

Database login

Start a bash terminal on the db container and run log into psql with the webapp role:

psql -U webapp

Then set the search path to the "otu" schema at the psql prompt

SET search_path TO otu;

Testing

There is a script to test the output of the OTU and Contextual Download feature. This counts and displays the number of unique OTU hashes in the OTU.fasta file, the number of unique Sample IDs in the contextual.csv file, and for each domain .csv file, counts and displays the number of unique OTU hashes and unique Sample IDs. The results can then be inspected to ensure they are as expected for the given search.

To run, download a search, extract the results to a directory, cd to that directory and run the script:

. /path/to/bpaotu/test/verify-otu-contextual-export.sh

Deployments

Bioplatforms Australia - Australian Microbiome Search Facility

Licence

Copyright © 2017, Bioplatforms Australia.

BPA OTU is released under the GNU Affero GPL. See source for a licence copy.

Contributing