ctmrbio / stag-mwc

StaG Metagenomic Workflow Collaboration
https://stag-mwc.readthedocs.org
MIT License
28 stars 13 forks source link

Metagenome assembly #21

Closed boulund closed 3 years ago

boulund commented 6 years ago

We need to figure out how we want to implement metagenome assembly. The discussion needs to be divided into several parts:

  1. What assembler(s) we want to use
  2. How to make the Snakemake technical implementation of the assembly step so that any assembler can be used to produced output files for steps downstream of the assembly
  3. What steps to add downstream of assembly

1. What assemblers to use

Several options are available to us. According to CAMI, MEGAHIT performed fairly well. It appears SPADes was not included in the evaluation, but it seems like a popular assembler nonetheless. I suggest we start with adding these, unless someone has other preferences.

2. How to implement it

Perhaps it is easiest if we let each assembler output into its own folder, but move the assembled contigs into a folder specifically for assembled contigs, for example (the actual outputs from the two assemblers in the example look different in real life):

output_dir/spades
├── sample1
│   ├── contigs.fa
│   ├── file1.txt
│   └── file2.fq
└── sample2
    ├── contigs.fa
    ├── file1.txt
    └── file2.fq
output_dir/megahit
├── sample1
│   ├── contigs.fa
│   ├── file1.txt
│   └── file2.fq
└── sample2
    ├── contigs.fa
    ├── file1.txt
    └── file2.fq
output_dir/assembled_contigs
├── sample1.contigs.megahit.fa
├── sample1.contigs.spades.fa
├── sample2.contigs.megahit.fa
└── sample2.contigs.spades.fa

A structure like this should make it fairly straightforward for us to design downstream steps that just process the contig files found in output_dir/assembled_contigs/, making it flexible enough so that users can pick whatever assembler they want, and also make comparisons between assemblers if that is interesting. It has the potential drawback of having to run all downstream steps that depend on assembly for the output from all assemblers used, but I guess most users will just pick one assembler and stick to that.

3. What steps to add downstream of assembly

Typical steps we could put here are binning steps, to bin the assembled contigs into "metagenome-assembled genomes" (MAGS)/"metagenomic species" (MGS), which can then be further analyzed. Two popular alternatives here could be MaxBin or CONCOCT, with a follow-on checkup of bins using CheckM. Potentially, the CheckM-related genome binner GroopM could be used if samples are taken from "at least 3 timepoints" (according to their documentation), which might make it slightly difficult to implement in StaG-mwc as a general tool, because the assumptions that GroopM makes probably fail for most of our study designs.

I know there was some talk about co-abundance clustering to produce co-abundance gene groups (CAGs), exactly how were you thinking about that @lis4matilda?

lis4matilda commented 6 years ago

I think we should wait with that implementation and try to finish a database based workflow first. After that we can implement an assembly approach to create a database to use in this workflow.

16 apr. 2018 kl. 09:59 skrev Fredrik Boulund notifications@github.com:

We need to figure out how we want to implement metagenome assembly. The discussion needs to be divided into several parts:

What assembler(s) we want to use How to make the Snakemake technical implementation it so that any assembler can be used to produced output files for steps downstream of the assembly What steps to add downstream of assembly

  1. What assemblers to use

Several options are available to us. According to CAMI, assemblers like MEGAHIT performed fairly well. It seems SPADes was not included in the evaluation, but it seems like a popular assembler nonetheless. I suggest we start with adding these, unless you or anyone else has other preferences

  1. How to implement it

Perhaps it is easiest if we let each assembler output into its own folder, but move the assembled contigs into a folder specifically for assembled contigs:

output_dir/spades ├── sample1 │ ├── contigs.fa │ ├── file1.txt │ └── file2.fq └── sample2 ├── contigs.fa ├── file1.txt └── file2.fq output_dir/megahit ├── sample1 │ ├── contiga.fa │ ├── file1.txt │ └── file2.fq └── sample2 ├── contiga.fa ├── file1.txt └── file2.fq output_dir/assembled_contigs ├── sample1.contigs.megahit.fa ├── sample1.contigs.spades.fa ├── sample2.contigs.megahit.fa └── sample2.contigs.spades.fa A structure like this should make it fairly straightforward for us to design downstream steps that just process the contig files found in output_dir/assembled_contigs/, making it flexible enough so that users can pick whatever assembler they want, and also make comparisons between assemblers if that is interesting. It has the potential drawback of having to run all downstream steps that depend on assembly for the output from all assemblers used, but I guess most users will just pick one assembler and stick to that.

  1. What steps to add downstream of assembly

Typical steps we could put here are binning steps, to bin the assembled contigs into "metagenome-assembled genomes" (MAGS)/"metagenomic species" (MGS), which can then be further analyzed. Two popular alternatives here could be MaxBin and CheckM. Potentially, the CheckM-related genome binner GroopM could be used if samples are taken from "at least 3 timepoints" (according to their documentation), which might make it slightly difficult to implement in StaG-mwc as a general tool, because the assumptions that GroopM makes probably fail for most of our study designs.

I know there was some talk about co-abundance clustering to produce co-abundance gene groups (CAGs), exactly how were you thinking about that @lis4matilda?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

boulund commented 6 years ago

Of course! No need to add it right now; I just wanted to start the discussion so we can talk it through properly before we start implementing anything.

boulund commented 6 years ago

I think a good starting point for producing a useful assembly-based workflow could be to take inspiration from the steps described in the assembly-based workflow in anvi'o. In fact, I think anvi'o is probably the premier visualization tool for assembly-based metagenomics today, so it would make sense.

boulund commented 5 years ago

We could also consider to not implement any assembly at all, and just use one of the other workflows that are available, e.g. https://metagenome-atlas.readthedocs.io/en/latest/

boulund commented 5 years ago

Some interesting bits in metaWRAP as well https://github.com/bxlab/metaWRAP

boulund commented 5 years ago

Started implementing assembly and binning using MetaWRAP. It's in the develop branch.

boulund commented 4 years ago

Found this online bioinformatics "course" material that contains some potentially interesting bits about assembly and binning, might be useful:

https://linsalrob.github.io/ComputationalGenomicsManual/

boulund commented 3 years ago

I think the assembly bit is now covered by the functionality offered by metawrap. The binning and downstream steps are still not properly tested and assessed, but it is definitely possible to get metagenome assemblies out from StaG at this time.