bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Non-unique Seq_name IDs in aligned_error_profile #151

Closed lauradunphy closed 2 years ago

lauradunphy commented 2 years ago

Hi,

Overall, great tool!

In working with the aligned_error_profile files created by simulator.py [genome], I noticed that Seq_name is not necessarily unique to a given read because multiple reads can have the same starting position. For example, if two reads from modelX both start at position 1000, both sets of errors will have the same Seq_name ("modelX_1000") and there is not an easy way to link them back to their respective reads. This makes it hard to accurately summarize errors by individual read and therefore difficult to resolve the discrepancies in final simulated read lengths discussed in #128.

Would it be possible to include the full read name in the aligned_error_profile files and/or to add the read index from the header to Seq_name?

Apologies if I missed a previous update.

Many thanks, Laura

kmnip commented 2 years ago

Hi Laura,

Which version are using?

As far as I know, the output read names should be unique. For example, a read's name could end with _aligned_7, where 7 is a number unique to the read.

Ka Ming

lauradunphy commented 2 years ago

Hi Ka Ming,

Thanks for your reply. I'm using v3.0.0. The read headers are indexed as you describe within the aligned_reads.fasta file.

My issue is within the separate _error_profile file that accompanies the FASTA and provides locations of errors for each read. The version I'm running only provides the first piece of the read header (outputName_readPosition) in this file. This is a problem when multiple reads start at the same position. I've pasted an example below using the E coli R9 reads provided in your tutorial.

Please let me know if this file format has been updated in a newer version.

Many thanks, Laura

ecoli_R9_2D_aligned_error_profile

Seq_name Seq_pos error_type error_length ref_base seq_base ENA|U00096|U00096_1494333 8037 ins 3 --- CCA ENA|U00096|U00096_1494333 8015 mis 1 A T ENA|U00096|U00096_1494333 8014 mis 1 G A ENA|U00096|U00096_1494333 8005 del 2 AT -- ENA|U00096|U00096_1494333 7948 mis 3 TAA CTG ENA|U00096|U00096_1494333 7880 del 2 TT --

kmnip commented 2 years ago

Thanks, I looked over the code in versions 3.0.0 ~ 3.0.2 and the master branch. It seems like both genome and metagenome modes do not use the full read names in the error profiles. This isn't an issue in the transcriptome mode. I do not know whether this was intentional, but I agree with everything you said. :) I will discuss this with the lead developers and we will look into fixing it.

lauradunphy commented 2 years ago

Awesome! Appreciate it :)

Many thanks, Laura