Simulated Dataset in the ESPRESSO paper

Xinglab / espresso

Other

57 stars 4 forks source link

Simulated Dataset in the ESPRESSO paper #23

Closed zzare-umd closed 1 year ago

zzare-umd commented 1 year ago

Hi,

I am using the simulated dataset in your paper, and I found out that there are two kinds of read names in the fasta files. For example, there are "ENST00000392433_274_unaligned_4132802_F_0_473_0" and "ENST00000392433_274_aligned_4132802_F_0_473_0". In this example, does the unaligned tag in the read name mean that there is no way this read can be align to the transcript "ENST00000392433"?

Regards, Zahra

EricKutschera commented 1 year ago

The simulated reads in the paper were generated with NanoSim. From https://github.com/bcgsc/NanoSim#2-simulation-stage-1

Each reads has "unaligned", "aligned", or "perfect" in the header determining their error rate. "unaligned" means that the reads have an error rate over 90% and cannot be aligned. "aligned" reads have the same error rate as training reads. "perfect" reads have no errors.

[...]

unaligned suggesting it should be unaligned to the reference