Feature request: New metagenome portal

Smithmania commented 2 years ago

Docs related to new metagenome features

hou098 commented 2 years ago

Feature summary.

Users need to be able to access metagenome information for sites.

Proposal:

Similar interface to existing bpaotu web app, but instead of the sites list below the search buttons linking to a download page (e.g. https://data.bioplatforms.com//organization/australian-microbiome?q=sample_id:102.100.100/138359 ) the site id links/buttons need to open up a metgenome panel, much like https://data.microbiomedata.org/ does. See example table below. Could either expand the site label when clicked or open up a modal dialog.

Sites with metagenome information will be present in the taxonomy files with an OTU strings starting with mxa_, so this interface doesn't need to see all sites, just those with a corresponding otu code of mxa_ something.

Could either have some kind of toggle switch on the UI to switch between regular and metagenome mode, or maybe just use a URL query string to select metagenome mode, and then link to that from some landing page.

Download URLs could probably be formulated with a simple pathname convention incorporating the site id e.g. https://download.example.com/something/$site_id/$something_else/something.xxx

NMDC example download panel

	Data Object Type	Data Object Description	File Size	Downloads
Workflow Activity: Read QC Activity for nmdc:mga0khk038
	Filtered Sequencing Reads	Reads QC result fastq (clean data)	9.1 GiB	0
Workflow Activity: Assembly Activity for nmdc:mga0khk038
	Assembly Coverage BAM	Sorted bam file of reads mapping back to the final assembly	10.8 GiB	0
	Assembly Scaffolds	Final assembly scaffolds fasta	1.5 GiB	0
	Assembly Contigs	Final assembly contigs fasta	1.5 GiB	0
Workflow Activity: Annotation Activity for nmdc:mga0khk038
	Annotation KEGG Orthology	Tab delimited file for KO annotation	42.0 MiB	0
	Structural Annotation GFF	GFF3 format file with structural annotations	204.0 MiB	0
	Annotation Enzyme Commission	Tab delimited file for EC annotation	27.3 MiB	0
	Functional Annotation GFF	GFF3 format file with functional annotations	371.6 MiB	0
	Annotation Amino Acid FASTA	FASTA amino acid file for annotated proteins	425.2 MiB	0
Workflow Activity: MAGs Analysis Activity for nmdc:mga0khk038
	CheckM Statistics	CheckM statistics report	765 B	0

hou098 commented 2 years ago

Data is provided by files named with sample id

hou098@terrible-hf:/mnt/data/work/amd/Metagenome_QC_reads$ pwd
/mnt/data/work/amd/Metagenome_QC_reads
hou098@terrible-hf:/mnt/data/work/amd/Metagenome_QC_reads$ 
hou098@terrible-hf:/mnt/data/work/amd/Metagenome_QC_reads$ 
hou098@terrible-hf:/mnt/data/work/amd/Metagenome_QC_reads$ ls|head -25
10714_HFLF3BCXX-1_merged.fastq.gz
10714_HFLF3BCXX-1_R1p.fastq.gz
10714_HFLF3BCXX-1_R1R2u.fastq.gz
10714_HFLF3BCXX-1_R2p.fastq.gz
10714.md5
10716_HFLF3BCXX-1_merged.fastq.gz
10716_HFLF3BCXX-1_R1p.fastq.gz
10716_HFLF3BCXX-1_R1R2u.fastq.gz
10716_HFLF3BCXX-1_R2p.fastq.gz
10716.md5
10718_HFLF3BCXX-2_merged.fastq.gz
10718_HFLF3BCXX-2_R1p.fastq.gz
10718_HFLF3BCXX-2_R1R2u.fastq.gz
10718_HFLF3BCXX-2_R2p.fastq.gz
10718.md5
10720_HFLF3BCXX-2_merged.fastq.gz
10720_HFLF3BCXX-2_R1p.fastq.gz
10720_HFLF3BCXX-2_R1R2u.fastq.gz
10720_HFLF3BCXX-2_R2p.fastq.gz
10720.md5
12424_combined_merged.fastq.gz
12424_combined_R1p.fastq.gz
12424_combined_R1R2u.fastq.gz
12424_combined_R2p.fastq.gz
12424.md5

hou098 commented 2 years ago

File descriptions and paths to example data for the bpa-otu metagenome enhancements from @Smithmania

File naming convention needs a bit more thought. We probably want everything to be of the form $sampleid-*

Data object type	Data object description	Data object methodology	Data object example file
Filtered sequencing reads - sampleID_*_R1p.fastq.gz	Quality filtered R1 paired reads	BBtools QC protocol	/datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644_combined_R1p.fastq.gz
Filtered sequencing reads - sampleID_*_R2p.fastq.gz	Quality filtered R2 paired reads	BBtools QC protocol	/datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644_combined_R2p.fastq.gz
Filtered sequencing reads - sampleID_*_merged.fastq.gz	Quality filtered merged reads	BBtools QC protocol	/datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644_combined_merged.fastq.gz
Filtered sequencing reads - sampleID_*_R1R2u.fastq.gz	Quality filtered unpaired reads	BBtools QC protocol	/datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644_combined_R1R2u.fastq.gz
checksum - sampleID.md5	md5 sum of above files		/datasets/work/oa-amd/work/amd/Metagenome_QC_reads/21644.md5

Worflow activity: Assembly activity
Assembly - 01.sampleID.fasta	Fasta file containing the contigs from the assembly	Squeezemets full workflow - input R1, R2	/datasets/work/oa-env-gen/work/Smith/Hadza/results/01.Hadza.fasta
Assembly statistics - 01.sampleID.lon	Length of the contigs	Squeezemets full workflow - input R1, R2	/datasets/work/oa-env-gen/work/Smith/Hadza/results/intermediate/01.Hadza.lon
Assembly statistics - 01.sampleID.stats	Assembly statistics (N50, N90, number of reads, etc)	Squeezemets full workflow - input R1, R2	/datasets/work/oa-env-gen/work/Smith/Hadza/results/intermediate/01.Hadza.stats
*BINNING* There are a number of options and outputs here, bins are calculated using metabat2 and maxbin and combined using DAStool for an example see directories at : /datasets/work/oa-env-gen/work/Smith/Hadza/results/ - Perhaps we can supply all binned fasta files or supply fasta files (only including one example here) associated with DAStool merged bins and include the final summary (19.sampleID.bintable) table as below
Assembly - maxbin.002.fasta.contigs.fa	Fasta file containig binned metagenomic reads	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/DAS/Hadza_DASTool_bins/maxbin.002.fasta.contigs.fa
Annotation - 19.sampleID.bintable	Compilation of all data for bins	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/19.Hadza.bintable

Worflow activity: Annotation activity
Annotation - sampleID_sqm_reads.out.allreads	Taxonomic and functional assignments for each read	Squeezemeta reads - input: R1p,merged,R1R2u	/datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.allreads
Annotation - sampleID_sqm_reads.out.allreads.funcog	Abundance of all COG functions	Squeezemeta reads - input: R1p,merged,R1R2u	/datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.allreads.funcog
Annotation - sampleID_sqm_reads.out.allreads.funkegg	Abundance of all KEGG functions	Squeezemeta reads - input: R1p,merged,R1R2u	/datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.allreads.funkegg
Annotation - sampleID_sqm_reads.out.allreads.mcount	Abundance of all taxa	Squeezemeta reads - input: R1p,merged,R1R2u	/datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.allreads.mcount
Annotation - sampleID_sqm_reads.out.allreads.mappingstat	Summary of total reads and hits to nr	Squeezemeta reads - input: R1p,merged,R1R2u	/datasets/work/oa-amd/work/amd-work/SQM_READS/21644/21644_sqm_reads.out.mappingstat
checksum for SQM reads - sampleID.md5	md5sum of SQM_reads files
Annotation - 02.sampleID.16S.txt	Assignment (RDP classifier) for the 16S rRNAs sequences found	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/02.Hadza.16S.txt
Annotation - 02.sampleID.rnas	Fasta file containing all RNAs found	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/02.Hadza.rnas
Annotation - 02.sampleID.trnas	Text file containing contig and position of tRNAs found	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/02.Hadza.trnas
Annotation - 02.sampleID.trnas.fasta	Fasta file containing the contigs resulting from the assembly, masking the positions where a tRNA was found	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/02.Hadza.trnas.fasta
Annotation - 02.sampleID.maskedrna.fasta	Fasta file containing the contigs resulting from the assembly, masking the positions where a RNA was found	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/intermediate/02.Hadza.maskedrna.fasta
Annotation - 03.sampleID.faa	Amino acid sequences for predicted ORFs	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/03.Hadza.faa
Annotation - 03.sampleID.fna	Nucleotide sequences for predicted ORFs	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/03.Hadza.fna
Annotation - 03.sampleID.gff	Features and position in contigs for each of the predicted genes	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/03.Hadza.gff
Annotation - 06.sampleID.fun3.tax.noidfilter.wranks	taxonomic assignments not considering identity filters for each ORF, including taxonomic ranks	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/06.Hadza.fun3.tax.noidfilter.wranks
Annotation - 06.sampleID.fun3.tax.wranks	taxonomic assignments for each ORF, including taxonomic ranks	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/06.Hadza.fun3.tax.wranks
Annotation - 07.sampleID.fun3.cog	COG functional assignment for each ORF	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/07.Hadza.fun3.cog
Annotation - 07.sampleID.fun3.kegg	KEGG functional assignment for each ORF	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/07.Hadza.fun3.kegg
Annotation - 07.sampleID.fun3.pfam	PFAM functional assignment for each ORF	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/07.Hadza.fun3.pfam
Annotation statistics - 10.sampleID.mappingstat	Mapping percentage of reads to samples	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/10.Hadza.mappingstat
Annotation statistics -10.sampleID.mapcount	Several measures regarding mapping of reads to ORFs	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/intermediate/10.Hadza.mapcount
Annotation statistics 10.sampleID.contigcov	Several measures regarding mapping of reads to ORFs	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/intermediate/10.Hadza.contigcov
Annotation - 11.sampleID.mcount	Abundance table of taxa	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/11.Hadza.mcount
Annotation - 12.sampleID.cog.funcover	measurements of the abundance and distribution of each COG	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/12.Hadza.cog.funcover
Annotation - 12.sampleID.kegg.funcover	measurements of the abundance and distribution of each KEGG	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/12.Hadza.kegg.funcover
Annotation - 13.sampleID.orftable	Several measures regarding ORF characteristics	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/13.Hadza.orftable
Annotation - 20.sampleID.contigtable	Compilation of data for contigs	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/20.Hadza.contigtable
Annotation - 21.sampleID.kegg.pathways	prediction of KEGG pathways in bins	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/21.Hadza.kegg.pathways
Annotation - 21.sampleID.metacyc.pathways	prediction of Metacyc pathways in bins	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/21.Hadza.metacyc.pathways
Annotation statistics - 22.sampleID.stats	Several statistics regarding ORFs, contigs and bins	Squeezemets full workflow - input R1, R2 Barrnap	/datasets/work/oa-env-gen/work/Smith/Hadza/results/22.Hadza.stats
checksum for SQM full workflow - sampleID.md5	md5sum of SQM full workflow		No file generated yet

hou098 commented 2 years ago

What about the download buttons? Do …

Download OTU and Contextual Data (CSV)
Download Contextual Data only (CSV)
Download BIOM format (Phinch compatible)

… make any sense in metagenome mode? Maybe replace "Download OTU and Contextual Data" with "Download metagenome files" which pops up a modal dialog containing checkboxes for the various metagenome files, then download a zip file of selected files for selected sites?

What about map display?

hou098 commented 2 years ago

After talking to @abissett 21 March 2022:

Metagenome download modal box also needs a link to download contextual data
Map search shows no sites when amplicon is metaxa_from_metagenomes, probably because there is no 20k abundance data. Fix this to just show sites.
The top level /map url has a link to it from https://data.bioplatforms.com/organization/about/australian-microbiome , so either retain that functionality or remove link.
For metagenome data, replace
- Download OTU and Contextual Data (CSV)
- Download Contextual Data only (CSV)
- Download BIOM format (Phinch compatible)
with "Download metagenome data (CSV)". Pop up a modal dialog to select required files for selected sites using checkboxes and provide a "download" button.
What if some sites don't have metagenome data? The current plan is just to use a simple pattern-based naming scheme for the download URLs. If some sites don't have metagenome files, that will result in 404 for those sites.

hou098 commented 2 years ago

Under construction in https://github.com/BioplatformsAustralia/bpaotu/tree/metagenome-feature-WIP All frontend stuff so far, with stubs in a few places.

hou098 commented 2 years ago

Discussed with @mtearle on 12 April 2022.

CKAN data is stored in Amazon S3 buckets, and that's where any extra metagenome files will go
There's a python ckanapi. See https://usersupport.bioplatforms.com/programmatic_access.html
Can bounce a user's browser through bpa for authentication and then on to time-limited S3 URL for direct-to-browser downloads for individual files.
Need to talk to @Smithmania to finalise naming conventions etc. See 'Data Transfer Document.pdf' from @mtearle and https://github.com/BioplatformsAustralia/bpaotu/issues/187#issuecomment-1062430623

hou098 commented 2 years ago

~~Depends on https://github.com/BioplatformsAustralia/bpaotu/issues/198 to allow filtering by map location.~~ Done: https://github.com/BioplatformsAustralia/bpaotu/commit/b3e6cf5560477c486e75cc89615f250a6e507064