bonsai-team / matam

Mapping-Assisted Targeted-Assembly for Metagenomics
GNU Affero General Public License v3.0
19 stars 9 forks source link

Run MATAM with large dataset #71

Closed yxxue closed 5 years ago

yxxue commented 5 years ago

HI, I'm trying to run MATAM with a large rRNA dataset (whole RNA sequencing, 5Gb PE). But it seems doesnt work well, as the first step takes a really long time to run and I terminate the program. I wonder does it can be used for large scale dataset? If so, how could I run it properly? Thanks.

ppericard commented 5 years ago

Hi @yxxue ,

In whole RNA sequencing dataset, with no rRNA depletion, you should expect between 50-80% of all reads to be rRNA sequences. Which means that if you have a big dataset like in your case, you will have millions of rRNA reads. In comparison, for whole metagenomic datasets, we usually have less than 100,000 reads, sometime only a few thousands.

The first step of MATAM consists in filtering and aligning the rRNA reads from your dataset onto the clustered SILVA database. This is done with SortMeRNA, which is one of the most sensitive and fastest dedicated software available now to do that. However, when trying to align millions of reads, it can indeed take some time. On previous analysis we did of whole RNA seq datasets with similar sizes, this alignment step could take up to 2 weeks. If possible, increasing the number of CPU given to MATAM will give you a direct speed-up of this step. A new version of SortMeRNA is also being developed and when that version will be more stable we will integrate it in MATAM.

For whole RNA sequencing datasets, we also highly recommend to use the MATAM option --coverage_threshold 500 which will dynamically sub-sample reads from highly conserved regions, and allow the following steps of MATAM to run much more quickly, with only a small loss of information. Basically, if you manage to wait until the filtering/alignment step finishes, the following steps should be much faster by using this option.

Another solution to speed up the assembly could also be to sub-sample your initial dataset. Since the majority of your reads should be from rRNA sequences then you could start by trying with 1% or 10% of your total dataset. You might lose very low coverage bacterial species but MATAM should be able to assemble the sequences from the most abundant species even with a small proportion of your initial dataset.

In any case, don't hesitate to come back to us. We are also very eager to improve MATAM performances on very big datasets (in terms of rRNA reads) like whole RNA seq, and we appreciate feedback from users.

yxxue commented 5 years ago

Thanks for your comments :) It really helps.