BioplatformsAustralia / bpaotu

OTU database access for the Australian Microbiome
GNU Affero General Public License v3.0
5 stars 1 forks source link

Feature request: New metagenome portal #187

Closed Smithmania closed 1 year ago

Smithmania commented 2 years ago

Docs related to new metagenome features

hou098 commented 2 years ago

Feature summary.

Users need to be able to access metagenome information for sites.

Proposal:

Similar interface to existing bpaotu web app, but instead of the sites list below the search buttons linking to a download page (e.g. https://data.bioplatforms.com//organization/australian-microbiome?q=sample_id:102.100.100/138359 ) the site id links/buttons need to open up a metgenome panel, much like https://data.microbiomedata.org/ does. See example table below. Could either expand the site label when clicked or open up a modal dialog.

Sites with metagenome information will be present in the taxonomy files with an OTU strings starting with mxa_, so this interface doesn't need to see all sites, just those with a corresponding otu code of mxa_ something.

Could either have some kind of toggle switch on the UI to switch between regular and metagenome mode, or maybe just use a URL query string to select metagenome mode, and then link to that from some landing page.

Download URLs could probably be formulated with a simple pathname convention incorporating the site id e.g. https://download.example.com/something/$site_id/$something_else/something.xxx


NMDC example download panel

Data Object TypeData Object DescriptionFile SizeDownloadsDownload
Workflow Activity: Read QC Activity for nmdc:mga0khk038
Filtered Sequencing ReadsReads QC result fastq (clean data)9.1 GiB0
Workflow Activity: Assembly Activity for nmdc:mga0khk038
Assembly Coverage BAMSorted bam file of reads mapping back to the final assembly10.8 GiB0
Assembly ScaffoldsFinal assembly scaffolds fasta1.5 GiB0
Assembly ContigsFinal assembly contigs fasta1.5 GiB0
Workflow Activity: Annotation Activity for nmdc:mga0khk038
Annotation KEGG OrthologyTab delimited file for KO annotation42.0 MiB0
Structural Annotation GFFGFF3 format file with structural annotations204.0 MiB0
Annotation Enzyme CommissionTab delimited file for EC annotation27.3 MiB0
Functional Annotation GFFGFF3 format file with functional annotations371.6 MiB0
Annotation Amino Acid FASTAFASTA amino acid file for annotated proteins425.2 MiB0
Workflow Activity: MAGs Analysis Activity for nmdc:mga0khk038
CheckM StatisticsCheckM statistics report765 B0
hou098 commented 2 years ago

Data is provided by files named with sample id

hou098@terrible-hf:/mnt/data/work/amd/Metagenome_QC_reads$ pwd
/mnt/data/work/amd/Metagenome_QC_reads
hou098@terrible-hf:/mnt/data/work/amd/Metagenome_QC_reads$ 
hou098@terrible-hf:/mnt/data/work/amd/Metagenome_QC_reads$ 
hou098@terrible-hf:/mnt/data/work/amd/Metagenome_QC_reads$ ls|head -25
10714_HFLF3BCXX-1_merged.fastq.gz
10714_HFLF3BCXX-1_R1p.fastq.gz
10714_HFLF3BCXX-1_R1R2u.fastq.gz
10714_HFLF3BCXX-1_R2p.fastq.gz
10714.md5
10716_HFLF3BCXX-1_merged.fastq.gz
10716_HFLF3BCXX-1_R1p.fastq.gz
10716_HFLF3BCXX-1_R1R2u.fastq.gz
10716_HFLF3BCXX-1_R2p.fastq.gz
10716.md5
10718_HFLF3BCXX-2_merged.fastq.gz
10718_HFLF3BCXX-2_R1p.fastq.gz
10718_HFLF3BCXX-2_R1R2u.fastq.gz
10718_HFLF3BCXX-2_R2p.fastq.gz
10718.md5
10720_HFLF3BCXX-2_merged.fastq.gz
10720_HFLF3BCXX-2_R1p.fastq.gz
10720_HFLF3BCXX-2_R1R2u.fastq.gz
10720_HFLF3BCXX-2_R2p.fastq.gz
10720.md5
12424_combined_merged.fastq.gz
12424_combined_R1p.fastq.gz
12424_combined_R1R2u.fastq.gz
12424_combined_R2p.fastq.gz
12424.md5
hou098 commented 2 years ago

File descriptions and paths to example data for the bpa-otu metagenome enhancements from @Smithmania

File naming convention needs a bit more thought. We probably want everything to be of the form $sampleid-*

Data object type Data object description   Data object methodology  Data object example file
Filtered  sequencing reads - sampleID_*_R1p.fastq.gz Quality filtered R1 paired reads BBtools QC protocol /datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644_combined_R1p.fastq.gz
Filtered  sequencing reads - sampleID_*_R2p.fastq.gz Quality filtered R2 paired reads BBtools QC protocol /datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644_combined_R2p.fastq.gz
Filtered  sequencing reads - sampleID_*_merged.fastq.gz Quality filtered merged reads BBtools QC protocol /datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644_combined_merged.fastq.gz
Filtered  sequencing reads - sampleID_*_R1R2u.fastq.gz Quality filtered unpaired reads BBtools QC protocol /datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644_combined_R1R2u.fastq.gz
checksum - sampleID.md5 md5 sum of above files   /datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644.md5
       
Worflow activity: Assembly activity      
Assembly - 01.sampleID.fasta Fasta file containing the contigs from the assembly Squeezemets full workflow - input R1, R2 /datasets/work/oa-env-gen/work/Smith/Hadza/results/01.Hadza.fasta
Assembly statistics - 01.sampleID.lon Length of the contigs Squeezemets full workflow - input R1, R2 /datasets/work/oa-env-gen/work/Smith/Hadza/results/intermediate/01.Hadza.lon
Assembly statistics - 01.sampleID.stats Assembly statistics (N50, N90, number of reads, etc) Squeezemets full workflow - input R1, R2 /datasets/work/oa-env-gen/work/Smith/Hadza/results/intermediate/01.Hadza.stats
BINNING There are a number of options and outputs here, bins are calculated using metabat2 and maxbin and combined using DAStool for an example see directories at : /datasets/work/oa-env-gen/work/Smith/Hadza/results/  - Perhaps we can supply all binned fasta files or supply fasta files (only including one example here) associated with DAStool merged bins and include the final summary (19.sampleID.bintable) table as below      
Assembly - maxbin.002.fasta.contigs.fa Fasta file containig binned metagenomic reads Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/DAS/Hadza_DASTool_bins/maxbin.002.fasta.contigs.fa
Annotation - 19.sampleID.bintable Compilation of all data for bins Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/19.Hadza.bintable
       
Worflow activity: Annotation activity      
Annotation - sampleID_sqm_reads.out.allreads Taxonomic and functional assignments for each read Squeezemeta reads - input: R1p,merged,R1R2u /datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.allreads
Annotation - sampleID_sqm_reads.out.allreads.funcog Abundance of all COG functions Squeezemeta reads - input: R1p,merged,R1R2u /datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.allreads.funcog
Annotation - sampleID_sqm_reads.out.allreads.funkegg Abundance of all KEGG functions Squeezemeta reads - input: R1p,merged,R1R2u /datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.allreads.funkegg
Annotation - sampleID_sqm_reads.out.allreads.mcount Abundance of all taxa Squeezemeta reads - input: R1p,merged,R1R2u /datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.allreads.mcount
Annotation - sampleID_sqm_reads.out.allreads.mappingstat Summary of total reads and hits to nr Squeezemeta reads - input: R1p,merged,R1R2u /datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.mappingstat
checksum for SQM reads - sampleID.md5 md5sum of SQM_reads files    
Annotation - 02.sampleID.16S.txt Assignment (RDP classifier) for the 16S rRNAs sequences found Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/02.Hadza.16S.txt
Annotation - 02.sampleID.rnas Fasta file containing all RNAs found Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/02.Hadza.rnas
Annotation - 02.sampleID.trnas Text file containing contig and position of tRNAs found Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/02.Hadza.trnas
Annotation - 02.sampleID.trnas.fasta Fasta file containing the contigs resulting from the assembly, masking the positions where a tRNA was found Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/02.Hadza.trnas.fasta
Annotation - 02.sampleID.maskedrna.fasta Fasta file containing the contigs resulting from the assembly, masking the positions where a RNA was found Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/intermediate/02.Hadza.maskedrna.fasta
Annotation - 03.sampleID.faa Amino acid sequences for predicted ORFs Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/03.Hadza.faa
Annotation - 03.sampleID.fna Nucleotide sequences for predicted ORFs Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/03.Hadza.fna
Annotation -  03.sampleID.gff Features and position in contigs for each of the predicted genes Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/03.Hadza.gff
Annotation - 06.sampleID.fun3.tax.noidfilter.wranks taxonomic assignments not considering identity filters for each ORF, including taxonomic ranks Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/06.Hadza.fun3.tax.noidfilter.wranks
Annotation - 06.sampleID.fun3.tax.wranks taxonomic assignments for each ORF, including taxonomic ranks Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/06.Hadza.fun3.tax.wranks
Annotation - 07.sampleID.fun3.cog COG functional assignment for each ORF Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/07.Hadza.fun3.cog
Annotation - 07.sampleID.fun3.kegg KEGG functional assignment for each ORF Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/07.Hadza.fun3.kegg
Annotation - 07.sampleID.fun3.pfam PFAM functional assignment for each ORF Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/07.Hadza.fun3.pfam
Annotation statistics - 10.sampleID.mappingstat Mapping percentage of reads to samples Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/10.Hadza.mappingstat
Annotation statistics -10.sampleID.mapcount Several measures regarding mapping of reads to ORFs Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/intermediate/10.Hadza.mapcount
Annotation statistics 10.sampleID.contigcov Several measures regarding mapping of reads to ORFs Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/intermediate/10.Hadza.contigcov
Annotation - 11.sampleID.mcount Abundance table of taxa Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/11.Hadza.mcount
Annotation - 12.sampleID.cog.funcover measurements of the abundance and distribution of each COG Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/12.Hadza.cog.funcover
Annotation - 12.sampleID.kegg.funcover measurements of the abundance and distribution of each KEGG Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/12.Hadza.kegg.funcover
Annotation - 13.sampleID.orftable Several measures regarding ORF characteristics Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/13.Hadza.orftable
Annotation - 20.sampleID.contigtable Compilation of data for contigs Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/20.Hadza.contigtable
Annotation - 21.sampleID.kegg.pathways prediction of KEGG pathways in bins Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/21.Hadza.kegg.pathways
Annotation - 21.sampleID.metacyc.pathways prediction of Metacyc pathways in bins Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/21.Hadza.metacyc.pathways
Annotation statistics - 22.sampleID.stats Several statistics regarding ORFs, contigs and bins Squeezemets full workflow - input R1, R2 Barrnap /datasets/work/oa-env-gen/work/Smith/Hadza/results/22.Hadza.stats
checksum for SQM full workflow - sampleID.md5 md5sum of SQM full workflow   No file generated yet
hou098 commented 2 years ago

What about the download buttons? Do …

… make any sense in metagenome mode? Maybe replace "Download OTU and Contextual Data" with "Download metagenome files" which pops up a modal dialog containing checkboxes for the various metagenome files, then download a zip file of selected files for selected sites?

What about map display?

hou098 commented 2 years ago

After talking to @abissett 21 March 2022:

hou098 commented 2 years ago

Under construction in https://github.com/BioplatformsAustralia/bpaotu/tree/metagenome-feature-WIP All frontend stuff so far, with stubs in a few places.

hou098 commented 2 years ago

Discussed with @mtearle on 12 April 2022.

hou098 commented 2 years ago

Depends on https://github.com/BioplatformsAustralia/bpaotu/issues/198 to allow filtering by map location. Done: https://github.com/BioplatformsAustralia/bpaotu/commit/b3e6cf5560477c486e75cc89615f250a6e507064

abissett commented 2 years ago

example of secondary data from BPA dataportal (threatened species initiative)

https://data.bioplatforms.com/dataset/bpa-tsi-genome-assembly-359774

hou098 commented 2 years ago

Initial metagenome data for one sample for testing: https://data.bioplatforms.com/dataset/bpa-amdb-metagenomics-analysed-21645

hou098 commented 1 year ago

Per-sample metagenome downloads now working in https://github.com/BioplatformsAustralia/bpaotu/commit/3815a36ab660751775bd0f8d682d4b9cb016c94b

Bulk downloads (i.e. multiple samples, multiple metagenome files) is still a work-in-progress. See https://github.com/BioplatformsAustralia/bpaotu/blob/3815a36ab660751775bd0f8d682d4b9cb016c94b/frontend/src/pages/search_page/components/metagenome_modal.tsx#L65

hou098 commented 1 year ago

Bulk downloads implemented in a1c82f41758127ae5526e741bef870ef1825712e

hou098 commented 1 year ago

Ready to test as of e564bc4e99756517d05e80cd6839f3c2d7db8b6f ( tag: 1.35.1-metagenomedemo4 )

hou098 commented 1 year ago

Implemented in https://github.com/BioplatformsAustralia/bpaotu/tree/1.36.0