I have started simulating an experiment with 2M reads and I realised the resulting fasta files contains 4000002 lines, that is, one more read than expected.
I then found out that there are two different reads with the same read ID:
The amount of duplicated read IDs growths with the size of the simulation. For instance, simulating 4M reads generates 3 duplicated read IDs: read1000001, read2000001 and read3000001
This can later cause problems with the downstream analysis as some tools may yield an error when encountering the same read ID twice. Also, for comparison purposes, I'd expect the number of produced reads to exactly match the number of reads in the provided count matrix. I can remove any of the reads with duplicated IDs but I'd rather have this solved from the polyester output.
I have had a quick look at the sgseq() function, it seems the issue is related to the offset value. I presume using the simulate_experiment() also yields some duplicated read IDs as it uses sgseq().
I'll look at this more into detail in the coming days to see if I can solve it myself. In the meantime, thanks in advance for your comments on this!
Hi @vllorens, what serendipitous timing. I JUST found this out myself as well. Thank you for already looking into it. I will continue with the search too, and we can keep each other posted here. Many thanks!
Hello,
I have started simulating an experiment with 2M reads and I realised the resulting fasta files contains 4000002 lines, that is, one more read than expected.
I then found out that there are two different reads with the same read ID:
The amount of duplicated read IDs growths with the size of the simulation. For instance, simulating 4M reads generates 3 duplicated read IDs: read1000001, read2000001 and read3000001
This can later cause problems with the downstream analysis as some tools may yield an error when encountering the same read ID twice. Also, for comparison purposes, I'd expect the number of produced reads to exactly match the number of reads in the provided count matrix. I can remove any of the reads with duplicated IDs but I'd rather have this solved from the polyester output.
I have had a quick look at the sgseq() function, it seems the issue is related to the offset value. I presume using the simulate_experiment() also yields some duplicated read IDs as it uses sgseq().
I'll look at this more into detail in the coming days to see if I can solve it myself. In the meantime, thanks in advance for your comments on this!