Open Miserlou opened 6 years ago
Here are the 10 accession codes which had the longest jobs which successfully completed and the length of the transcriptome index we used to run it:
accession_code | index_type
----------------+---------------------
SRR4423743 | TRANSCRIPTOME_SHORT
SRR5342767 | TRANSCRIPTOME_SHORT
SRR3666783 | TRANSCRIPTOME_SHORT
SRR6494603 | TRANSCRIPTOME_SHORT
SRR1524241 | TRANSCRIPTOME_LONG
SRR4423749 | TRANSCRIPTOME_SHORT
SRR6297667 | TRANSCRIPTOME_LONG
SRR6877472 | TRANSCRIPTOME_LONG
SRR4423750 | TRANSCRIPTOME_SHORT
SRR6494612 | TRANSCRIPTOME_SHORT
These transcriptome indices can be downloaded here: https://s3.amazonaws.com/data-refinery-s3-transcriptome-index-circleci-prod/DANIO_RERIO_TRANSCRIPTOME_LONG.tar.gz
These samples are also derived from .sra
files, extracted with fasterq-dump
.
Could our issue have anything to do with the bug mentioned in this unmerged pull request?
To complete @rob-p's request I am tagging @hiraksarkar
@cgreene thanks for tagging. Looking into the failure.
@Miserlou are you running the SalmonTools master branch?
Yes, we use git clone https://github.com/COMBINE-lab/SalmonTools.git
and build that. Does your branch fix this error?
Hi @Miserlou, So I forked a version and use that with some modification, as I wanted the zipped-extracted files. but the code is more or less same. https://github.com/hiraksarkar/SalmonTools is the one I use. I generally use the fastq from embl sites. Can you give me a copy of your fastq from zenodo or some other storage. Would debug with that.
PS: If possible also the unmapped_names.txt
@hiraksarkar : We would also prefer the files be zipped. The next step of our process is actually to zip them. So if you and we are the only people using this functionality of SalmonTools, maybe it would make sense to bring that functionality into the main repo also?
@cgreene, noted, I will create a pull request.
Is it the same as #1?
Yup, I guess I created this before.
Tagging @rob-p as it seems I don't have write access to this repo.
Here are some example files which caused the problem: https://zenodo.org/record/1438469
Here's a screenshot of our Salmon pipeline, without and with Salmontools. Salmon is taking ~5-10 minutes, Salmontools is taking multiple hours:
Perhaps you can shed some light on top this - very occasionally, we see salmontools processes which seem to never terminate.
Here you can see some which have been operating for more than 4 hours and which are still consuming full CPU:
Here is the sample in question: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2432103
Do you have any idea what might be causing this?
Sorry that this isn't a more reproducible report!