How to use multiple RNA samples?

xiekunwhy commented 2 years ago

Hi,

If there are several or many RNA samples, merge fastq files and then mapping and assemble OR mapping and assemble each sample separately?

Best, Kun

andreaminio commented 2 years ago

Hi Kun,

That really depends from your data, but as general rule, the best solution would be to assemble each sample separately. This allows the highest precision, the lowest noise and the least probability of misassemblies. However this may not be possible, or just feasible according on the size , the number and the dynamic range of the libraries.

Personally I do prefer to assemble separately each sample merging together the libraries of the different replicates. This should ensure a good sequencing depth, dynamic range remains unchanged but capping the coverage with Trinity allows a better coverage of lowly expressed transcripts gene body.

However, when samples are just too low in coverage or the distribution too skewed toward a few highly expressed genes, I may extend the "bulking" logic to more than just replicates. This increases the noise in the signal, as more AS events may be present and increase the complexity of the assembly procedure (and of the graph), but it may be possible to recover more info for the "shadowed". I try to keep as low as possible the dynamic range within the datasets I'm assembling...concatenating all of the them is the last (desperate) solution.

I hope this will help you,

Andrea

xiekunwhy commented 2 years ago

Hi Andrea,

Thank you for your reply, it help me a lot. I am now strugling to merge the assemblies generated from difference samples, do you have any suggestions about how to get a better meta results ?

Best, Kun

andreaminio commented 2 years ago

Hi Kun,

It depends on what is the final goal of the merging. If the purpose is annotating a genome, this pipeline does a good job in filtering low quality alignments and at the same time collapsing/clustering redundant mRNA models. Preprocessing + PASA are the key, as comparing multiple assembly algorithms and the final models with the genome allow excluding low quality assemblies and binning in the "putative" loci allows a finer identification of redundant assemblies even when they are troncated/incomplete.

The pipeline used to create a training set of gene models for ab initio tools is more aggressive in terms of precision and tends to "purge" more transcripts and models to keep only the high quality ones, the pipeline for mapping gives a more broad (and complete) spectrum of models. Which one gives the best depends on your aim.

If your needs are different, because you need tracking of the original samples clustered together or you do cannot/not want to make use of the genome (i.e. too distant genetic background, not available, etc.) you may search a solution in genome independent transcriptome reconstruction pipelines like the excellent evigene --> http://arthropods.eugenes.org/EvidentialGene/trassembly.html

I hope this is of some use, I'm aware it is kinda vague but the right pipeline strongly depends from your aim and your needs on which I don't have enough details to be more specific.

Best,

Andrea

andreaminio commented 2 years ago

I'll close the issue for now.

andreaminio / AnnotationPipeline-EVM_based-DClab

How to use multiple RNA samples? #2