arumugamlab / MIntO

Pipeline for Reproducible and Scalable Integration of Metagenomic and Metatranscriptomic Data
MIT License
14 stars 10 forks source link

MIntO with only Meta-transcriptomics data #8

Closed sarpiens closed 11 months ago

sarpiens commented 1 year ago

Hello,

I was interested in using your pipeline for a meta-transcriptomics analysis that I'm preparing. However, I do not have any meta-genomic data. I was wondering if I could still use your pipeline to process only meta-transcriptomic data? If so, could you please give me some pointers on which mode should I use and any considerations that I should take into account in this particular case?

Thanks in advance! Best regards

arumugamlab commented 1 year ago

Hello,

Yes, this is possible in theory. It does come with some constraints though. The main limitation is this: to use metatranscriptomic (metaT) data by itself, MIntO (i) needs to know which "genome" each gene belongs to, and (ii) needs to know the 10 marker genes from that genome. Then it can normalize the expression level of a given gene against the "housekeeping" background and you have now removed the confounding effect of the abundance of that species. You could provide this information via (i) reference genomes, or (ii) MAGs generated from your own metaT data. Both have pluses and minuses. And MIntO can help you with both.

Some thoughts on the two approaches: 1. Reference genomes: These may not represent your samples very well. but if they do, then that's your best bet. 2. Generating MAGs from metaT data: This is also tricky because contigs will not get too long as reads won't cover intergenic regions. When you only have short contigs (often maximum 100k bp), it is also not easy to get many high-quality MAGs. However, if you have numerous samples (say >50 or better >100), you increase the number of high-quality MAGs.

To give you the best advice, I would need to know characteristics of your samples, sequencing depth, number of samples and whether there are repeated samples. If you could share some more information, we can discuss further and chart a good plan.

Best wishes, Mani

sarpiens commented 1 year ago

Hello Mani,

Thanks for you quick response. We have 6 sheep rumen fluid samples with metatranscriptomics data (twp for each condition), no technical replicates were generated, PAIRED Fastqs. After applying quality control steps (filtering and trimming reads, host filtering and rRNA filtering), they have a sequencing depth of more than 30 million reads (raw sequencing depth around 40 million reads). In our case, what would be the most recommendable approach?

Thanks in advance! Best regards

arumugamlab commented 1 year ago

Hello,

Without metaG data, to get reliable gene expression levels that you can compare across samples, you would need to run marker-gene normalization using genomes.

You can make MAGs with metaT data. But with 6 samples, you will likely generate very low quality MAGs. metaT data will assemble into shorter contigs than metaG data, which will fragment the genomes, and with fewer samples the MAG quality will suffer. You can still try it, just to see what happens. After MAG generation, you may or may not have created high-quality MAGs. You can check the number of .fna files in metaT/8-1-binning/mags_generation_pipeline/unique_genomes/. If the directory does not exist, or if the snakemake job quit with error, then it means MAG generation was unsuccessful. If you succeed, then after running gene_annotation.smk and gene_abundance.smk, you can check the output in metaT/9-mapping-profiles/MAG-genes/all.p95.filtered.profile.abund.all.maprate.txt to see what percentage of reads from each of your samples maps to the MAGs. If you get very low numbers, you probably should give up on MAGs.

So, I would recommend you to try the reference genome mode. MIntO will annotate the genomes and generate gene and functional profiles that can then be comparable across samples. If you don't know what could be good reference genomes, you can check the output from taxonomic profiling with MetaPhlAn or mOTUs. That can give you an idea of which reference genomes to gather for your study. If you are lucky and the proportion of "unmapped" or "unknown" species is not too high, then your analysis will cover most species in the samples.

We have just released v2.0.0-beta.1. Most functionality I mentioned above only exists in the recent version. Things should run smoothly from installation until final analysis. We have also included a test script here. Please clone this version using:

git clone https://github.com/arumugamlab/MIntO.git
cd MIntO
git checkout tags/2.0.0-beta.1

and let me know if you still run into issues. If you have already run some of the steps and reached until 5-1-sortmerna step, I can walk you through how you can continue the analysis with the new version, as the old assembly.yaml and mapping.yaml files will not work with the new version.

I am happy to provide more advice as you go along. Hope our pipeline can help you interpret the biology in your study.

Best wishes, Mani

arumugamlab commented 11 months ago

Closing as there was no feedback or follow up. Please reopen if it continues to be an issue.