Closed zoey-rw closed 1 year ago
Hi Zoey,
Thanks for raising this issue! Indeed I now realize that documentation is lacking for this usage of metaGEM
, I will create a new page in the Wiki to address this.
In short, metaGEM
creates a number of folders where it stores and expects to find sample-specfic-subdirectories for input/output files. The most important folder to configure is the dataset
folder, which is used to extract sample IDs that are used for wildcard expansion in the Snakefile:
There is more information about this here. metaGEM
is optimized for users to run an entire analysis from raw reads, so if you don't have raw data can simply create empty sample specific subfolders within the dataset
folder. Alternatively, you could also modify the above quoted line, replacing the dataset
folder for qfiltered
.
Ok, now that we have properly configured wildcards to expand sample IDs, let's look at the crossMapSeries rule:
As you can see, the rule takes in the entire qfiltered
folder as the second input, as it will cycle through this folder to map each set of reads to an assembly. This folder should have sample specific sub-directories which contain paired end read files ending with fastq.gz
e.g. SRR12557734_R1.fastq.gz
, SRR12557734_R2.fastq.gz
.
Additionally, the output of the megahit assembly rule is taken in as an input via the shorthand rules.megahit.output
. We can see what this file is called and where it lives by looking at the megahit rule itself:
As you can see, the contigs should be named contigs.fasta.gz
and stored in the assemblies
folder within sample-specific-subdirectories. I should also note that, within the assembly rule, I use sed
to replace all spaces with hyphens in the contig headers.
In summary, if you have the dataset
, assemblies
, and qfiltered
folders configured as described here then you should be in good shape for cross-mapping and downstream analysis. Hope this helps and let me know if you have any issues with this!
Best wishes, Francisco
A late follow-up on this: I also have co-assemblies that I would like to use as input for cross-mapping. I noticed in earlier metaGEM development you tested this approach, do you happen to know which code snippets I could pull from? Thanks!
Hi Zoey, apologies for the late response. Unfortunately I abandoned this approach very early on in the development of metaGEM so I do not have any code to share. If you are still looking, perhaps this pipeline may have some coassembly code for you to pull from.
https://github.com/Finn-Lab/MAG_Snakemake_wf
Best, Francisco
I am trying to run metaGEM using a dataset that has already been quality filtered and assembled into contigs. I'm trying to format the data the way metaGEM wants it, but I can't get it right (maybe because I'm not "touch"ing the files in the right order for Snakemake?).
Is there any documentation on how users should input files when starting at the crossMap/binning step of the pipeline?
Thank you!