"Searching for junctions via segment mapping" slow with 1 million reads

Hello,

I have an issue in running TopHat2 on a set of strand specific single end reads on the mouse genome (mm10). Basically, the TopHat analysis (I used both TopHat 2.0.13 and 2.1.0) gets stuck at the "Searching for junctions via segment mapping", and the computing cluster kills the jobs after three day. When using the same TopHat settings also in other analysis (alignment in human or macaque genomes) I had no problem, and the "Searching for junctions via segment mapping" was pretty fast (between 1 and 3 hours depending on the number of reads).

This is the tophat command I used (same for mouse, human and macaque):

tophat -p 12 -o $SAMPLE_NAME -a 8 -i 40 -I 1000000 --read-realign-edit-dist 0 --microexon-search --library-type fr-firststrand $BOWTIE_INDEX $FASTQ

I then checked what happens if I try to map only 1 million reads (tested on human and mouse). This time human was super fast (around 30 minutes with one core), while alignment on mouse took more than 12 hours to complete (always with one core), with the "Searching for junctions via segment mapping" being the limiting step (11 hours). I also tried to remove the "--microexon-search" setting but the run took always around 12 hours in mouse (with 1M reads).

I built all the bowtie2 indexes in the same way (using Bowtie2.2.4), and I also tried to build it twice for mouse, using both the toplevel.dna assemby (containing mouse haplotypes) and the primary_assemby.dna (without haplotypes), but things didn't change. I don't think that the issue might be related to memory availability, since I ran also some tests on a large memory machine and had the same problems.

Do you have any idea about what might be causing the program to get stuck in this step only for one species?

Thanks for your support

I realize that this is a very late reply, but it would be good if this or similar issues were solved or prevented. I encountered a similar issue while working at the SeqAn upgrade and I found a subtle string conversion bug that caused segment_juncs to generate a very large number of fake junctions, so TopHat was spending a lot of time processing those, building a splice index for such a large junction db and searching against it etc.. I have no idea if that bug was related to the issue reported here, because I don't see why that wouldn't happen on the human reads as well.. Anyway, that bug was fixed in v2.1.1 (which was just released), so if you are still watching this topic I would be curious to know if that mouse data set would exhibit the same symptoms with the new version..

DaehwanKimLab / tophat

"Searching for junctions via segment mapping" slow with 1 million reads #20