alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
90 stars 51 forks source link

Duplicated read IDs using simulate_experiment_countmat() #50

Open vllorens opened 6 years ago

vllorens commented 6 years ago

Hello,

I have started simulating an experiment with 2M reads and I realised the resulting fasta files contains 4000002 lines, that is, one more read than expected.

I then found out that there are two different reads with the same read ID:

read1000001/649967984 Bacsa_2669 3110994..3111788(-)(NC_015164) [Bacteroides salanitronis BL78, DSM 18170];mate1:353-452;mate2:329-428 read1000001/649967984 Bacsa_2669 3110994..3111788(-)(NC_015164) [Bacteroides salanitronis BL78, DSM 18170];mate1:393-492;mate2:566-665

The amount of duplicated read IDs growths with the size of the simulation. For instance, simulating 4M reads generates 3 duplicated read IDs: read1000001, read2000001 and read3000001

This can later cause problems with the downstream analysis as some tools may yield an error when encountering the same read ID twice. Also, for comparison purposes, I'd expect the number of produced reads to exactly match the number of reads in the provided count matrix. I can remove any of the reads with duplicated IDs but I'd rather have this solved from the polyester output.

I have had a quick look at the sgseq() function, it seems the issue is related to the offset value. I presume using the simulate_experiment() also yields some duplicated read IDs as it uses sgseq().

I'll look at this more into detail in the coming days to see if I can solve it myself. In the meantime, thanks in advance for your comments on this!

JMF47 commented 6 years ago

Hi @vllorens, what serendipitous timing. I JUST found this out myself as well. Thank you for already looking into it. I will continue with the search too, and we can keep each other posted here. Many thanks!