bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Simulated reads from transcripts with tpm equal 0 #116

Closed agshumate closed 3 years ago

agshumate commented 3 years ago

Hi, I have many transcripts in my expression.tsv which are not expressed and thus have TPM 0. However, many reads from these are still being simulated in the 'unaligned' file. Is this the intended behavior? Thanks, Alaina

SaberHQ commented 3 years ago

Hey Alaina,

Here is a good explanation about unaligned reads from my colleague Ka Ming Nip:

The unaligned reads are meant for representing garbage reads from the basecallers. The unaligned and aligned reads used to be outputted within the same file. We decided to split them into separate files to give more flexibility for the users. Depending on what you do with the simulated reads, you can ignore the unaligned reads.

And here is some more information regarding unaligned reads from the Trans-NanoSim paper.

Unaligned reads may provide crucial information about the nature of ONT sequencing experiments, and thus we chose to model the length distribution of the unaligned reads as well. For this purpose, we extract sequences from reference transcripts based on their length distribution and apply an arbitrarily high error rate (default, 90%). However, because it is impossible to trace their source transcript molecule, unaligned reads are not included in the error rate analysis

As for conclusion, I should say that Trans-NanoSim uses expression levels to simulate aligned reads and those are the read sets I would suggest you to use in your analysis. In another words, Trans-NanoSim does not consider expression levels to simulate unaligned reads. I also encourage you to take a look at the Methods part of the paper to learn more about how reads are generated in details.

We will be very happy to help you with any questions. Thanks for using NanoSim pipeline.

agshumate commented 3 years ago

that makes sense, thank you for your help!