franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data
https://franciscozorrilla.github.io/metaGEM/
MIT License
203 stars 42 forks source link

documentation on running metaGEM using user-generated contig assemblies #56

Closed zoey-rw closed 1 year ago

zoey-rw commented 3 years ago

I am trying to run metaGEM using a dataset that has already been quality filtered and assembled into contigs. I'm trying to format the data the way metaGEM wants it, but I can't get it right (maybe because I'm not "touch"ing the files in the right order for Snakemake?).

Is there any documentation on how users should input files when starting at the crossMap/binning step of the pipeline?

Thank you!

franciscozorrilla commented 3 years ago

Hi Zoey,

Thanks for raising this issue! Indeed I now realize that documentation is lacking for this usage of metaGEM, I will create a new page in the Wiki to address this.

In short, metaGEM creates a number of folders where it stores and expects to find sample-specfic-subdirectories for input/output files. The most important folder to configure is the dataset folder, which is used to extract sample IDs that are used for wildcard expansion in the Snakefile:

https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/Snakefile#L14

There is more information about this here. metaGEM is optimized for users to run an entire analysis from raw reads, so if you don't have raw data can simply create empty sample specific subfolders within the dataset folder. Alternatively, you could also modify the above quoted line, replacing the dataset folder for qfiltered.

Ok, now that we have properly configured wildcards to expand sample IDs, let's look at the crossMapSeries rule:

https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/Snakefile#L390-L397

As you can see, the rule takes in the entire qfiltered folder as the second input, as it will cycle through this folder to map each set of reads to an assembly. This folder should have sample specific sub-directories which contain paired end read files ending with fastq.gz e.g. SRR12557734_R1.fastq.gz, SRR12557734_R2.fastq.gz.

Additionally, the output of the megahit assembly rule is taken in as an input via the shorthand rules.megahit.output. We can see what this file is called and where it lives by looking at the megahit rule itself:

https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/Snakefile#L273-L278

As you can see, the contigs should be named contigs.fasta.gz and stored in the assemblies folder within sample-specific-subdirectories. I should also note that, within the assembly rule, I use sed to replace all spaces with hyphens in the contig headers.

https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/Snakefile#L313-L323

In summary, if you have the dataset, assemblies, and qfiltered folders configured as described here then you should be in good shape for cross-mapping and downstream analysis. Hope this helps and let me know if you have any issues with this!

Best wishes, Francisco

zoey-rw commented 2 years ago

A late follow-up on this: I also have co-assemblies that I would like to use as input for cross-mapping. I noticed in earlier metaGEM development you tested this approach, do you happen to know which code snippets I could pull from? Thanks!

franciscozorrilla commented 2 years ago

Hi Zoey, apologies for the late response. Unfortunately I abandoned this approach very early on in the development of metaGEM so I do not have any code to share. If you are still looking, perhaps this pipeline may have some coassembly code for you to pull from.

https://github.com/Finn-Lab/MAG_Snakemake_wf

Best, Francisco