RobertsLab / resources

https://robertslab.github.io/resources/
18 stars 10 forks source link

Describe taxonomic diversity in the metagenomic assembly #547

Closed sr320 closed 5 years ago

sr320 commented 5 years ago

fasta: http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/megahit_out/final.contigs.fa

then use coverage file http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/coverage.txt or samfile http://gannet.fish.washington.edu/Atumefaciens/20190102_metagenomics_geo_megahit/aln.sam.gz to look at abundance.

sr320 commented 5 years ago

@kubu4 I believe you also did some gene prediction - could you write up a short results section on this analysis? - will need it for metaproteomic paper

Please add methods and results @ https://docs.google.com/document/d/1amaNX86VUDcXi0UGzgmHYt8QVe6fSuVcT1oYohlFCDM/edit?ts=5b918dab

kubu4 commented 5 years ago

OK, I've done a "quick" analysis of this and have a pretty nice figure that displays the taxonomic diversity of the metagenomics data (using Krona plot). However, I've only done this using BLASTp data (figured it would be faster). Is it more appropriate to classify things at the nucleotide level?

sr320 commented 5 years ago

Go ahead and get what you have done  in the paper and start a nucleotide level search On Mar 25, 2019, 7:25 AM -0700, kubu4 notifications@github.com, wrote:

OK, I've done a "quick" analysis of this and have a pretty nice figure that displays the taxonomic diversity of the metagenomics data (using Krona plot). However, I've only done this using BLASTp data (figured it would be faster). Is it more appropriate to classify things at the nucleotide level? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

kubu4 commented 5 years ago

Alrighty, I've added the info about the large metagenome assembly that I've done so far (includes nucleotide and protein-level taxonomic Krona plots).

Info has been added to Materials & Methods and the Results sections.

kubu4 commented 5 years ago

Just stumbled across some new software and visualizations for metagenomics:

http://merenlab.org/2016/06/22/anvio-tutorial-v2/

Will explore a bit more and try to use it. Looks insanely good/thorough, with great tutorials!

kubu4 commented 5 years ago

Update. Running Anvi'o, but it'll take awhile. Saw this when looking at SLURM output today:

20190404_003

In response to the "memory skull" in the blue area at the bottom and the 478GB of RAM (not to mention, the progress on the contigs was ~3 -5 contigs/second) notation, I opted to put the Maker job on hold and launch this on the 500GB srlab node to see if the increased memory will help this progress faster. If not, I'll continue the Maker run and switch Anvi'o back to coenv. However, the progress I was seeing suggests that the Anvi'o analysis would take many weeks (or, longer). :open_mouth:

sr320 commented 5 years ago

Let’s find a quicker option - see review paper I posted On Apr 4, 2019, 3:32 PM -0700, kubu4 notifications@github.com, wrote:

Update. Running Anvi'o, but it'll take awhile. Saw this when looking at SLURM output today: In response to the "memory skull" in the blue area at the bottom and the 478GB of RAM (not to mention, the progress on the contigs was ~3 -5 contigs/second) notation, I opted to put the Maker job on hold and launch this on the 500GB srlab node to see if the increased memory will help this progress faster. If not, I'll continue the Maker run and switch Anvi'o back to coenv. However, the progress I was seeing suggests that the Anvi'o analysis would take many weeks (or, longer). 😮 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

kubu4 commented 5 years ago

Will do. I know Anvi'o incorporates some of those programs into it's pipeline (e.g. CONCOCT for sample binning).

sr320 commented 5 years ago

For reference below

LikelyBin says "reasonable time" :)

Features

Sequencing technologies

Illumina  High throughput; low errors; short reads

Ion Torrent  High throughput; low errors; short reads

Pacific Biosciences  Medium throughput; high raw error rate; long reads

Oxford Nanopore  Medium throughput; high raw error rate; long reads

Metagenomic assembly

MetaVelvet  Linux/Unix command-line tool; requires large amounts of RAM; may take several days to run

MetaVelvet-SL  Extension to MetaVelvet with similar charateristics; improved detection of chimeras

IDBA-UD  Linux/Unix command-line tool; requires large amounts of RAM; may take several days to run

Ray Meta  Linux/Unix command-line tool; designed for high-performance computing (HPC) and uses multiple-cores; uses MPI; capable of dealing with very large datasets

Megahit  Linux/Unix command-line tool; lower memory and processor requirements, though only for certain options

Pell et al  Linux/Unix command-line code implemented as part of the khmer Python codebase (https://github.com/dib-lab/khmer)

MetAMOS  Linux/Unix command-line tool; depends on many other software tools; may require large amounts of RAM depending on the assembler used

Binning

LikelyBin  Linux/Unix command-line; designed to run on simple commodity/desktop PCs in a reasonable time

PHYSCIMM  Linux/Unix command-line; requires 50Gb RAM and 24 hours to build models

MetaWatt  GUI-based; designed to run on desktop hardware

CONCOCT  Linux/Unix command-line; depends on other software; initially used Ray Meta for assembly

LSA  Linux/Unix command-line; uses 10s of Gb of RAM

Gene Prediction

MetaGeneAnnotator  Available as Linux/Unix command-line or through web interface (web interface limited to 10Mb)

Orphelia  Available as Linux/Unix command-line or through web interface (web interface limited to 30Mb)

Glimmer-MG  Available as Linux/Unix command-line; depends on other software; model building requires download of all current bacterial genomes

FragGenScan  Available as Linux/Unix command-line; designed to run on commodity/desktop hardware in minutes/hours

Prokka  Available as Linux/Unix command-line; depends on other software; uses parallel processing

Domain DBs

InterPro  A consortium of 14 protein/domain/family databases

InterProScan  Available as Linux/Unix command-line; or web-interface; or via API

Pathway Databases

Reactome  Online resource for reactions/pathways; data available to download; accessible via web interface or via APIs

KEGG  Online resource for reactions/pathways; data available to download for a fee; accessible via web interface or via APIs

MetaCyc  Online resource for reactions/pathways; data available to download; accessible via web interface or via APIs

WikiPathways  Online resource for reactions/pathways; data available to download; accessible via web interface or via APIs

Targetted Gene Discovery

Xander  Available as Linux/Unix command-line; depends on other software; requires user to build gene-specific models

Data sharing and online portals

Meta4  Accessible via a web-interface once system has been set up!  System set up requires knowledge of Linux, Apache and Perl

MG-RAST  Online system with graphical user interface

EBI Metagenomics  Online system with graphical user interface; requires data to be in EBI ENA

IMG/M  Online system with graphical user interface