COMBINE-lab / SalmonTools

Useful tools for working with Salmon output
BSD 3-Clause "New" or "Revised" License
36 stars 20 forks source link

Rare Infinite Loop While Extracting Unmapped? #2

Open Miserlou opened 5 years ago

Miserlou commented 5 years ago

Perhaps you can shed some light on top this - very occasionally, we see salmontools processes which seem to never terminate.

Here you can see some which have been operating for more than 4 hours and which are still consuming full CPU:

screen shot 2018-09-23 at 1 37 31 pm

Here is the sample in question: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2432103

Do you have any idea what might be causing this?

Sorry that this isn't a more reproducible report!

kurtwheeler commented 5 years ago

Here are the 10 accession codes which had the longest jobs which successfully completed and the length of the transcriptome index we used to run it:

accession_code |     index_type      
----------------+---------------------
 SRR4423743     | TRANSCRIPTOME_SHORT
 SRR5342767     | TRANSCRIPTOME_SHORT
 SRR3666783     | TRANSCRIPTOME_SHORT
 SRR6494603     | TRANSCRIPTOME_SHORT
 SRR1524241     | TRANSCRIPTOME_LONG
 SRR4423749     | TRANSCRIPTOME_SHORT
 SRR6297667     | TRANSCRIPTOME_LONG
 SRR6877472     | TRANSCRIPTOME_LONG
 SRR4423750     | TRANSCRIPTOME_SHORT
 SRR6494612     | TRANSCRIPTOME_SHORT

These transcriptome indices can be downloaded here: https://s3.amazonaws.com/data-refinery-s3-transcriptome-index-circleci-prod/DANIO_RERIO_TRANSCRIPTOME_LONG.tar.gz

https://s3.amazonaws.com/data-refinery-s3-transcriptome-index-circleci-prod/DANIO_RERIO_TRANSCRIPTOME_SHORT.tar.gz

Miserlou commented 5 years ago

These samples are also derived from .sra files, extracted with fasterq-dump.

Could our issue have anything to do with the bug mentioned in this unmerged pull request?

cgreene commented 5 years ago

To complete @rob-p's request I am tagging @hiraksarkar

hiraksarkar commented 5 years ago

@cgreene thanks for tagging. Looking into the failure.

hiraksarkar commented 5 years ago

@Miserlou are you running the SalmonTools master branch?

Miserlou commented 5 years ago

Yes, we use git clone https://github.com/COMBINE-lab/SalmonTools.git and build that. Does your branch fix this error?

hiraksarkar commented 5 years ago

Hi @Miserlou, So I forked a version and use that with some modification, as I wanted the zipped-extracted files. but the code is more or less same. https://github.com/hiraksarkar/SalmonTools is the one I use. I generally use the fastq from embl sites. Can you give me a copy of your fastq from zenodo or some other storage. Would debug with that.

PS: If possible also the unmapped_names.txt

cgreene commented 5 years ago

@hiraksarkar : We would also prefer the files be zipped. The next step of our process is actually to zip them. So if you and we are the only people using this functionality of SalmonTools, maybe it would make sense to bring that functionality into the main repo also?

hiraksarkar commented 5 years ago

@cgreene, noted, I will create a pull request.

cgreene commented 5 years ago

Is it the same as #1?

hiraksarkar commented 5 years ago

Yup, I guess I created this before.

hiraksarkar commented 5 years ago

Tagging @rob-p as it seems I don't have write access to this repo.

Miserlou commented 5 years ago

Here are some example files which caused the problem: https://zenodo.org/record/1438469

Here's a screenshot of our Salmon pipeline, without and with Salmontools. Salmon is taking ~5-10 minutes, Salmontools is taking multiple hours:

image