COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
776 stars 164 forks source link

Gibbs sampler hangs forever on one of my samples #111

Closed DarwinAwardWinner closed 7 years ago

DarwinAwardWinner commented 7 years ago

One of my samples seems to be giving Salmon's Gibbs sampler some trouble:

$ salmon quant --index /home/ryan/references/hg38/Salmon_index_hg38.analysisSet_knownGene --unmatedReads fastq_files/SRR2454059.fq.gz --threads 8 --libType ISF --seqBias --gcBias --useVBOpt --dumpEq --dumpEqWeights --geneMap /home/ryan/references/hg38/Salmon_index_hg38.analysisSet_knownGene/genemap.txt --output salmon_quant/hg38.analysisSet_knownGene/SRR2454059 --auxDir aux_info --numGibbsSamples 100
Version Info: This is the most recent **development version** of Salmon.
### salmon (mapping-based) v0.7.3
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { /home/ryan/references/hg38/Salmon_index_hg38.analysisSet_knownGene }
### [ unmatedReads ] => { fastq_files/SRR2454059.fq.gz }
### [ threads ] => { 8 }
### [ libType ] => { ISF }
### [ seqBias ] => { }
### [ gcBias ] => { }
### [ useVBOpt ] => { }
### [ dumpEq ] => { }
### [ dumpEqWeights ] => { }
### [ geneMap ] => { /home/ryan/references/hg38/Salmon_index_hg38.analysisSet_knownGene/genemap.txt }
### [ output ] => { salmon_quant/hg38.analysisSet_knownGene/SRR2454059 }
### [ auxDir ] => { aux_info }
### [ numGibbsSamples ] => { 100 }
Logs will be written to salmon_quant/hg38.analysisSet_knownGene/SRR2454059/logs
[2016-12-13 12:44:39.271] [jointLog] [info] parsing read library format
[2016-12-13 12:44:39.271] [jointLog] [info] There is 1 library.
[2016-12-13 12:44:39.836] [jointLog] [info] Loading Quasi index
[2016-12-13 12:44:39.836] [jointLog] [info] Loading 32-bit quasi index
[2016-12-13 12:44:39.836] [stderrLog] [info] Loading Suffix Array 
[2016-12-13 12:44:43.439] [stderrLog] [info] Loading Transcript Info 
[2016-12-13 12:44:44.355] [stderrLog] [info] Loading Rank-Select Bit Array
[2016-12-13 12:44:44.613] [stderrLog] [info] There were 182608 set bits in the bit array
[2016-12-13 12:44:44.629] [stderrLog] [info] Computing transcript lengths
[2016-12-13 12:44:44.629] [stderrLog] [info] Waiting to finish loading hash
[2016-12-13 12:44:44.629] [stderrLog] [info] Done loading index
[2016-12-13 12:44:44.629] [jointLog] [info] done
[2016-12-13 12:44:44.629] [jointLog] [info] Index contained 182608 targets
[2016-12-13 12:44:49.583] [jointLog] [warning] Fragment GC bias correction is currently *experimental* in 
single-end libraries.  Please use this option with caution.

processed 19000000 fragments
hits: 65897660; hits per frag:  3.46963

[2016-12-13 12:46:51.776] [jointLog] [info] Computed 137534 rich equivalence classes for further processin
g
[2016-12-13 12:46:51.776] [jointLog] [info] Counted 16265961 total reads in the equivalence classes 

[2016-12-13 12:46:51.786] [jointLog] [info] Mapping rate = 83.509%

[2016-12-13 12:46:51.786] [jointLog] [info] finished quantifyLibrary()
[2016-12-13 12:46:51.786] [jointLog] [info] Starting optimizer
[2016-12-13 12:46:51.867] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2016-12-13 12:46:51.877] [jointLog] [info] iteration = 0 | max rel diff. = 299.948
[2016-12-13 12:46:51.959] [jointLog] [info] iteration 11, adjusting effective lengths to account for biase
s
[2016-12-13 12:52:42.161] [jointLog] [info] Computed expected counts (for bias correction)
[2016-12-13 12:52:42.162] [jointLog] [info] processed bias for 0.0% of the transcripts
[2016-12-13 12:53:40.240] [jointLog] [info] processed bias for 10.0% of the transcripts
[2016-12-13 12:54:40.032] [jointLog] [info] processed bias for 20.0% of the transcripts
[2016-12-13 12:55:40.277] [jointLog] [info] processed bias for 30.0% of the transcripts
[2016-12-13 12:56:41.225] [jointLog] [info] processed bias for 40.0% of the transcripts
[2016-12-13 12:57:41.888] [jointLog] [info] processed bias for 50.0% of the transcripts
[2016-12-13 12:58:37.148] [jointLog] [info] processed bias for 100.0% of the transcripts
[2016-12-13 12:58:37.809] [jointLog] [info] iteration = 100 | max rel diff. = 0.272558
[2016-12-13 12:58:38.544] [jointLog] [info] iteration = 200 | max rel diff. = 0.213603
[2016-12-13 12:58:39.275] [jointLog] [info] iteration = 300 | max rel diff. = 0.254034
[2016-12-13 12:58:40.017] [jointLog] [info] iteration = 400 | max rel diff. = 0.120455
[2016-12-13 12:58:40.754] [jointLog] [info] iteration = 500 | max rel diff. = 0.020828
[2016-12-13 12:58:41.487] [jointLog] [info] iteration = 600 | max rel diff. = 0.0153173
[2016-12-13 12:58:42.222] [jointLog] [info] iteration = 700 | max rel diff. = 0.0746574
[2016-12-13 12:58:42.359] [jointLog] [info] iteration = 720 | max rel diff. = 0.00908527
[2016-12-13 12:58:42.368] [jointLog] [info] Finished optimizer
[2016-12-13 12:58:42.368] [jointLog] [info] writing output 

[2016-12-13 12:58:42.943] [jointLog] [info] Starting Gibbs Sampler
  0% [>                                                    ] ETA > 1 week       

It remained stuck at 0% until I terminated it with Control+C around 2016-12-13 14:21, nearly 1.5 hours after the Gibbs sample started. Running it again on the same sample, I see the same behavior. I'm currently re-running with different options to see if that makes a difference.

DarwinAwardWinner commented 7 years ago

Oh, and I'm using the dev version of salmon from commit 1395a0ae394c924b6e206236b25efad52984b245 "Replace lock with try_lock in FragmentLengthDistribution".

DarwinAwardWinner commented 7 years ago

When I take away the bias options and run:

salmon quant \
> --index /home/ryan/references/hg38/Salmon_index_hg38.analysisSet_knownGene \
> --unmatedReads fastq_files/SRR2454059.fq.gz \
> --threads 8 \
> --libType ISF \
> --useVBOpt \
> --geneMap /home/ryan/references/hg38/Salmon_index_hg38.analysisSet_knownGene/genemap.txt \
> --output salmon_quant/hg38.analysisSet_knownGene/SRR2454059 \
> --auxDir aux_info \
> --numGibbsSamples 100

It finishes successfully, but during Gibbs sampling, I get endless lines of:

minEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight errorminEQClassWeight
rob-p commented 7 years ago

Hi @DarwinAwardWinner,

Great error report! Thanks for the details. The minEQClassWeight error is actually very interesting (i.e. the fact that it's hitting that code path during the Gibbs sampling). Would you be able to share the data set (or a part of the data set) that reproduces the issue? I'd be happy to take a look and see if I can figure out what's going on.

DarwinAwardWinner commented 7 years ago

Yes, the data is all public, so I have no problems sharing it, except finding a place to host it publically. I can't just link to the public source because there's some preprocessing steps. Both the FASTQ and the Salmon index are over 1.5 GB, so they're too big for my free Dropbox account, and my institute's network is tightly firewalled, so I can't just server if via a local server. Do you have any suggestions?

DarwinAwardWinner commented 7 years ago

I've just used https://transfer.pcloud.com/ to share the files with you. I think you should get an email with the download link. I've included the FASTQ, the index (entire folder in a tar.gz) and the compiled-from-source salmon executable that I used.

rob-p commented 7 years ago

Thanks! I got the e-mail. I'll take a look as soon as I have a chance.

DarwinAwardWinner commented 7 years ago

Another nugget of information: when I run the same sample using the Ensembl annotation, it works fine.

rob-p commented 7 years ago

So I got the data and am trying to repro the issue now (thanks!). Quick question. I noticed in your example command you have --libType ISF. However, you have single-end reads, so the appropriate library type would be SF (i.e., they can't be "inward" facing reads, b/c there is no mate for each read). When I run your command as is, but replace ISF as SF, my run completes successfully, and I don't get any errorminEQClassWeight output. Could you let me know if this makes any difference for you (also, sorry that, apparently, we're not outputting a useful error message when one passes in a paired-end library type with single-end data).

edit: Actually, it's even stranger. I noticed that in your command the library type comes after the reads to which it refers, but in this case, Salmon will not apply that library type to those reads (which explains why you're not getting a warning message). The restriction that the --libType flag comes before the reads it describes is buried in the docs, but I definitely need to make that clearer. Anyway, the point is that, in this case, Salmon should apply the "default" single-end library type (i.e., U) to your reads. So, presumably, that was what was happening when you saw the strange behavior during Gibbs sampling (and is also what was happening when my Gibbs sampling run completed successfully).

DarwinAwardWinner commented 7 years ago

Well, if they were paired end, they would be ISF. I assumed that Salmon would simply ignore the pairing information if you fed it single-end reads. (I think this is how some other tools work, maybe?) I'll retry with fixed library specifications and see if that fixes things.

Edit: I just noticed your edit. I'll reply again in a minute.

DarwinAwardWinner commented 7 years ago

So, does Salmon run into issues when it runs in unstranded mode on stranded data? Was that the problem?

rob-p commented 7 years ago

No, that shouldn't cause a problem. When I ran your command (including the ISF and placing the --libType after the set of reads), my run still completed successfully (and didn't produce any warnings during Gibbs sampling). Salmon's behavior when running in unstranded mode on stranded data is simply to map the reads in the orientation they match, and to report on the console (and in the log) that there was a mapping bias (i.e. that the data look stranded). Specifically, here is what I get when I run (a close approximation of) your command.

$salmon quant --index Salmon_index_hg38.analysisSet_knownGene --unmatedReads SRR2454059.fq.gz --libType ISF --useVBOpt --output test_quant --
numGibbsSamples 100 --threads 16
Version Info: This is the most recent **development version** of Salmon.
### salmon (mapping-based) v0.7.3
### [ program ] => salmon
### [ command ] => quant
### [ index ] => { Salmon_index_hg38.analysisSet_knownGene }
### [ unmatedReads ] => { SRR2454059.fq.gz }
### [ libType ] => { ISF }
### [ useVBOpt ] => { }
### [ output ] => { test_quant }
### [ numGibbsSamples ] => { 100 }
### [ threads ] => { 16 }
Logs will be written to test_quant/logs
[2016-12-13 22:38:54.413] [jointLog] [info] parsing read library format
[2016-12-13 22:38:54.413] [jointLog] [info] There is 1 library.
[2016-12-13 22:38:56.240] [stderrLog] [info] Loading Suffix Array
[2016-12-13 22:38:56.240] [jointLog] [info] Loading Quasi index
[2016-12-13 22:38:56.240] [jointLog] [info] Loading 32-bit quasi index
[2016-12-13 22:39:01.268] [stderrLog] [info] Loading Transcript Info
[2016-12-13 22:39:02.630] [stderrLog] [info] Loading Rank-Select Bit Array
[2016-12-13 22:39:03.041] [stderrLog] [info] There were 182608 set bits in the bit array
[2016-12-13 22:39:03.159] [stderrLog] [info] Computing transcript lengths
[2016-12-13 22:39:03.160] [stderrLog] [info] Waiting to finish loading hash
[2016-12-13 22:39:07.653] [stderrLog] [info] Done loading index
[2016-12-13 22:39:07.653] [jointLog] [info] done
[2016-12-13 22:39:07.653] [jointLog] [info] Index contained 182608 targets

processed 19000000 fragments
hits: 65897209; hits per frag:  3.47349

[2016-12-13 22:40:22.572] [jointLog] [info] Computed 137534 rich equivalence classes for further processing
[2016-12-13 22:40:22.572] [jointLog] [info] Counted 16265961 total reads in the equivalence classes
[2016-12-13 22:40:22.618] [jointLog] [info] Mapping rate = 83.509%

[2016-12-13 22:40:22.618] [jointLog] [info] finished quantifyLibrary()
[2016-12-13 22:40:22.619] [jointLog] [info] Starting optimizer
[2016-12-13 22:40:22.904] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2016-12-13 22:40:22.911] [jointLog] [info] iteration = 0 | max rel diff. = 299.976
[2016-12-13 22:40:23.620] [jointLog] [info] iteration = 100 | max rel diff. = 0.121769
[2016-12-13 22:40:24.367] [jointLog] [info] iteration = 200 | max rel diff. = 0.103587
[2016-12-13 22:40:25.102] [jointLog] [info] iteration = 300 | max rel diff. = 0.144748
[2016-12-13 22:40:25.815] [jointLog] [info] iteration = 400 | max rel diff. = 0.231057
[2016-12-13 22:40:26.505] [jointLog] [info] iteration = 500 | max rel diff. = 0.0156154
[2016-12-13 22:40:27.020] [jointLog] [info] iteration = 570 | max rel diff. = 0.00955966
[2016-12-13 22:40:27.052] [jointLog] [info] Finished optimizer
[2016-12-13 22:40:27.052] [jointLog] [info] writing output

[2016-12-13 22:40:27.523] [jointLog] [info] Starting Gibbs Sampler 1 week
100% [=====================================================] in 44s
[2016-12-13 22:41:12.189] [jointLog] [info] Finished Gibbs Sampler
[2016-12-13 22:41:12.190] [jointLog] [warning] NOTE: Read Lib [SRR2454059.fq.gz] :

Detected a *potential* strand bias > 1% in an unstranded protocol check the file: test_quant/lib_format_counts.json for details

edit: One note is that I was using my build of the same commit number. I'm running the executable you compiled now (since I had to put the appropriate libraries in the LD_LIBRARY_PATH to get it to be happy).

DarwinAwardWinner commented 7 years ago

Ok, that's the command without the bias options. It only hangs on the Gibbs sampling when you use the bias parameters as well.

rob-p commented 7 years ago

Yup, but I'm just trying to work backward. That is, first figure out what's going on with the minEqWeight error, and then figure out if that's related to what was causing it to hang. The issue right now is just that I can't seem to easily repro either Gibbs error. I just ran using the binary you compiled with the following command, and got this output:

$LD_LIBRARY_PATH=~/SoftwareStaging/salmon/lib:$LD_LIBRARY_PATH ./salmon quant --index Salmon_index_hg38.analysisSet_knownGene --unmatedReads SRR2454059.fq.gz --libType ISF --us
eVBOpt --output test_quant --numGibbsSamples 100 --threads 16
Version Info: This is the most recent **development version** of Salmon.
### salmon (mapping-based) v0.7.3
### [ program ] => salmon
### [ command ] => quant
### [ index ] => { Salmon_index_hg38.analysisSet_knownGene }
### [ unmatedReads ] => { SRR2454059.fq.gz }
### [ libType ] => { ISF }
### [ useVBOpt ] => { }
### [ output ] => { test_quant }
### [ numGibbsSamples ] => { 100 }
### [ threads ] => { 16 }
Logs will be written to test_quant/logs
[2016-12-13 22:44:07.409] [jointLog] [info] parsing read library format
[2016-12-13 22:44:07.409] [jointLog] [info] There is 1 library.
[2016-12-13 22:44:09.318] [jointLog] [info] Loading Quasi index
[2016-12-13 22:44:09.318] [jointLog] [info] Loading 32-bit quasi index
[2016-12-13 22:44:09.318] [stderrLog] [info] Loading Suffix Array
[2016-12-13 22:44:15.002] [stderrLog] [info] Loading Transcript Info
[2016-12-13 22:44:16.278] [stderrLog] [info] Loading Rank-Select Bit Array
[2016-12-13 22:44:16.625] [stderrLog] [info] There were 182608 set bits in the bit array
[2016-12-13 22:44:16.680] [stderrLog] [info] Computing transcript lengths
[2016-12-13 22:44:16.681] [stderrLog] [info] Waiting to finish loading hash
[2016-12-13 22:44:20.485] [stderrLog] [info] Done loading index
[2016-12-13 22:44:20.485] [jointLog] [info] done
[2016-12-13 22:44:20.485] [jointLog] [info] Index contained 182608 targets

processed 19000001 fragments
hits: 65897764; hits per frag:  3.48152

[2016-12-13 22:45:33.192] [jointLog] [info] Computed 137534 rich equivalence classes for further processing
[2016-12-13 22:45:33.192] [jointLog] [info] Counted 16265961 total reads in the equivalence classes
[2016-12-13 22:45:33.233] [jointLog] [info] Mapping rate = 83.509%

[2016-12-13 22:45:33.233] [jointLog] [info] finished quantifyLibrary()
[2016-12-13 22:45:33.234] [jointLog] [info] Starting optimizer
[2016-12-13 22:45:33.516] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2016-12-13 22:45:33.523] [jointLog] [info] iteration = 0 | max rel diff. = 299.95
[2016-12-13 22:45:34.217] [jointLog] [info] iteration = 100 | max rel diff. = 0.122252
[2016-12-13 22:45:34.912] [jointLog] [info] iteration = 200 | max rel diff. = 0.102915
[2016-12-13 22:45:35.612] [jointLog] [info] iteration = 300 | max rel diff. = 0.145792
[2016-12-13 22:45:36.357] [jointLog] [info] iteration = 400 | max rel diff. = 0.217489
[2016-12-13 22:45:37.055] [jointLog] [info] iteration = 500 | max rel diff. = 0.0159298
[2016-12-13 22:45:37.628] [jointLog] [info] iteration = 569 | max rel diff. = 0.00958049
[2016-12-13 22:45:37.653] [jointLog] [info] Finished optimizer
[2016-12-13 22:45:37.653] [jointLog] [info] writing output

[2016-12-13 22:45:38.213] [jointLog] [info] Starting Gibbs Sampler
100% [=====================================================] in 31s
[2016-12-13 22:46:10.451] [jointLog] [info] Finished Gibbs Sampler
[2016-12-13 22:46:10.451] [jointLog] [warning] NOTE: Read Lib [SRR2454059.fq.gz] :

Detected a *potential* strand bias > 1% in an unstranded protocol check the file: test_quant/lib_format_counts.json for details

i.e. I don't seem to get the complaints from the Gibbs sampler and all output files look to be created properly. I'm trying to figure out what could be different.

rob-p commented 7 years ago

We seem to get the same mapping rate etc., so I'm assuming something different is happening between the end of the normal optimization and the Gibbs sampling between your run and my run.

DarwinAwardWinner commented 7 years ago

I'm on Ubuntu 16.04 64-bit, in case that matters.

rob-p commented 7 years ago

I really hope it doesn't 😟 , but the testing I've been doing is on 64-bit 14.10. I have a 16.04 box I can try it on as well.

running on the 16.10 machine now to see if I can repro.

rob-p commented 7 years ago

Damn; I get the same behavior on 16.10 (actually, my box is 16.04, but I'm skeptical that this specific version issue is the cause of the behavior) as I get on 14.10 (using the executable you provided). It runs to completion and modulo our less-than-ideal handling of the --libType flag coming after the --unmatedReads flag, it completes without the Gibbs warning. When I run with --seqBias and --gcBias, I get the same behavior (it runs and finishes w/o hanging or outputting the error messages). I'm going to take a look at the relevant code path to see if anything stands out to me.

DarwinAwardWinner commented 7 years ago

I just got the hanging behavior again with a different sample and annotation (Ensembl transcripts instead of UCSC knownGene). I won't be able to send it to you until tomorrow though, since I'm at home working over ssh right now. But in the meantime, I'll run it again to make sure it's reproducible.

rob-p commented 7 years ago

Great; hopefully I'll be able to repro the issue with the other sample. No rush, as I'll be finishing putting together the final for my class tomorrow morning (and so will be testing the sample between writing exam questions ;P). I just hope this doesn't turn out to be an environment / machine-specific behavior (those are the worst bugs to track down and fix).

DarwinAwardWinner commented 7 years ago

Ok, the worst possible thing happened: I ran it a second time and it worked just fine. Does Salmon use any random number generation, and if so, is there an option to set the seed to get a deterministic run? Or is it potentially non-deterministic due to multi-threading?

One additional threading-related tidbit: I believe that when it does get stuck, a single CPU core is at 100% and the others are all unused.

rob-p commented 7 years ago

Damn. There are random numbers, but even setting the seed, the multi-threaded nature of almost all steps leads to non-deterministic behavior. The behavior you describe sounds like some sort of race condition that gets triggered depending on when threads get to different parts of the code. I couldn't get the hanging with the other dataset over multiple (~10) runs. So even if it's completely non-deterministic you seem to be getting it with higher frequency in your system. Is it always in the Gibbs phase? One question / thought, did salmon fetch and build the Intel TBB dependency, or are you using a system version?

DarwinAwardWinner commented 7 years ago

I think it probably fetched and built it? Here's what CmakeCache.txt says:

$ grep TBB CMakeCache.txt
//TBB debug library directory
TBB_DEBUG_LIBRARY_DIRS:PATH=
TBB_INCLUDE_DIR:PATH=/home/ryan/src/salmon/external/install/include
//TBB include directory
TBB_INCLUDE_DIRS:PATH=/home/ryan/src/salmon/external/install/include
TBB_LIBRARY:FILEPATH=/home/ryan/src/salmon/external/install/lib/libtbb.so
TBB_LIBRARY_DEBUG:FILEPATH=TBB_LIBRARY_DEBUG-NOTFOUND
//TBB library directory
TBB_LIBRARY_DIRS:PATH=/home/ryan/src/salmon/external/install/lib
TBB_MALLOC_LIBRARY:FILEPATH=/home/ryan/src/salmon/external/install/lib/libtbbmalloc.so
TBB_MALLOC_LIBRARY_DEBUG:FILEPATH=TBB_MALLOC_LIBRARY_DEBUG-NOTFOUND
TBB_RECONFIGURE:UNINITIALIZED=TRUE
TBB_WILL_RECONFIGURE:UNINITIALIZED=FALSE
//ADVANCED property for variable: TBB_DEBUG_LIBRARY_DIRS
TBB_DEBUG_LIBRARY_DIRS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: TBB_INCLUDE_DIR
TBB_INCLUDE_DIR-ADVANCED:INTERNAL=1
//ADVANCED property for variable: TBB_INCLUDE_DIRS
TBB_INCLUDE_DIRS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: TBB_LIBRARY
TBB_LIBRARY-ADVANCED:INTERNAL=1
//ADVANCED property for variable: TBB_LIBRARY_DEBUG
TBB_LIBRARY_DEBUG-ADVANCED:INTERNAL=1
//ADVANCED property for variable: TBB_LIBRARY_DIRS
TBB_LIBRARY_DIRS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: TBB_MALLOC_LIBRARY
TBB_MALLOC_LIBRARY-ADVANCED:INTERNAL=1
//ADVANCED property for variable: TBB_MALLOC_LIBRARY_DEBUG
TBB_MALLOC_LIBRARY_DEBUG-ADVANCED:INTERNAL=1

Also, here's the output of every hardware/OS reporting command I can think of:

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2623 v3 @ 3.00GHz
stepping        : 2
microcode       : 0x36
cpu MHz         : 3300.000
cache size      : 10240 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqd
q dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4
_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_sh
adow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm x
saveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs            :
bogomips        : 5985.57
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
...
[And 7 more cores]
$ uname -a
Linux salomon24 4.4.0-51-generic #72-Ubuntu SMP Thu Nov 24 18:29:54 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -d
Description:    Ubuntu 16.04.1 LTS

lshw output: salomon24-lshw.txt

DarwinAwardWinner commented 7 years ago

Ok, the hang has happened for at least 3 samples, but I've only seen it happen a second time for the first sample (the one I reported originally). It's definitely only running a single core at 100% usage when the hang happens, so presumably it's either hanging in a non-multithreaded part of the code, or all the other threads are just waiting for input and taking up no CPU.

rob-p commented 7 years ago

Ok, nothing about your hardware, as reported, above, raises any ideas. I'll keep looking at the code that executes between the end of VBEM and Gibbs. However, one thing we might try to narrow down the offending region is running salmon under gdb (I can do this as well, but it's not useful on my end yet since I can't get the hanging behavior). If you run salmon under gdb and then issue a SIGINT (i.e., press ctrl+C) when the hanging behavior occurs, this will return you to gdb. At that point, you can issue the backtrace command (bt) at the gdb prompt, and it will tell us where in the code the thread was at the point the process was interrupted. Hopefully, this can narrow things down to an offending function.

DarwinAwardWinner commented 7 years ago

Ok, I'll run it several times under gdb and see if I can get it to crash.

DarwinAwardWinner commented 7 years ago

I haven't got it to reproduce the error yet, but I did see the error in one more run outside of gdb.

On the plus side, I did find this: http://stackoverflow.com/questions/6545763/how-can-i-rerun-a-program-with-gdb-until-a-segmentation-fault-occurs. So now I can have it keep trying forever.

rob-p commented 7 years ago

Man, I was afraid it would be hard to repro inside of GDB. Sorry about that! The solution you found does seem generally useful though. Another possibility is to try to run it outside GDB, but to use strace (see the first answer here). This would let the program run outside of GDB (where you can repro the issue), but still get access to the stack at the point of the hanging.

DarwinAwardWinner commented 7 years ago

Ok, new plan:

while true; do
    strace -s 99 -f -o salmon-strace.log \
           salmon quant \
           --index /home/ryan/references/hg38/Salmon_index_$REF \
           --libType SR --unmatedReads fastq_files/$SAMPLE.fq.gz\
           --threads 8 --seqBias --gcBias --useVBOpt --dumpEq --dumpEqWeights \
           --geneMap /home/ryan/references/hg38/Salmon_index_$REF/genemap.txt \
           --output salmon_temp/REF/$SAMPLE \
           --auxDir aux_info \
           --numGibbsSamples 100
done
DarwinAwardWinner commented 7 years ago

Hmm... That post also suggests attaching gdb to a process after it's running. Would that also be a viable option? Wait until it hangs and then attach gdb?

rob-p commented 7 years ago

I've not tried that before, but I presume it would work (probably in a similar way to strace). Either seem like good options at this point!

DarwinAwardWinner commented 7 years ago

Ok, I have an strace file from a hung run. I couldn't attach with gdb because the default system security settings prohibit it. I'll change the settings and try to get a gdb backtrace, but in the meantime, here's the strace log. Note that it was hung for about 2 hours before I was able to collect the log.

https://www.dropbox.com/s/zn7qzo55wtcrbyg/salmon-strace.log.gz?dl=0

rob-p commented 7 years ago

30-some MB to 7.5GB ... that's some damn good compression. While strace won't let me see the offending function, the log definitely indicates that there's just a bunch of threads waiting. This must mean that there's the potential for a deadlock condition somewhere. Is it the case that this happened once the Gibbs sampler had started? That's the code I've been looking in, but all of the threading there is handled by Intel's TBB library (apart from a call to actually dump each recorded Gibbs sample, which uses a std::lock_guard, which, hopefully, isn't broken).

DarwinAwardWinner commented 7 years ago

Oh wow, I didn't realize it was 7.5 GB. I work on a ZFS filesystem with compression, so it was only 350MB on my disk. Anyway, this is what the last output on screen looks like when it hangs:

[2016-12-13 12:58:42.943] [jointLog] [info] Starting Gibbs Sampler
  0% [>                                                    ] ETA > 1 week       

So I don't know if the Gibbs sampler has actually started, but it does claim to have started.

rob-p commented 7 years ago

Yup, the execution is definitely inside the Gibbs sampler at that point, since that's the code that sets up the progress bar etc. So, I'll focus my attention there until (if/when) we can get a specific offending function name. Thanks so much for all your help tracking this down so far; I really appreciate it!

DarwinAwardWinner commented 7 years ago

Success???

$ gdb -ex "attach $(pgrep salmon | head -1)" -ex bt -ex detach -ex quit
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 29332
[New LWP 29334]
[New LWP 29335]
[New LWP 29336]
[New LWP 21224]
[New LWP 21225]
[New LWP 21226]
[New LWP 21227]
[New LWP 21228]
[New LWP 21229]
[New LWP 21230]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fcb8cf73789 in __ieee754_log_avx (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/e_log.c:193
193 ../sysdeps/ieee754/dbl-64/e_log.c: No such file or directory.
#0  0x00007fcb8cf73789 in __ieee754_log_avx (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/e_log.c:193
#1  0x0000000000637ccc in double std::gamma_distribution<double>::operator()<pcg_detail::engine<unsigned int, unsigned long, pcg_detail::xsh_rr_mixin<unsigned int, unsigned long>, true, pcg_detail::unique_stream<unsigned long>, pcg_detail::default_multiplier<unsigned long> > >(pcg_detail::engine<unsigned int, unsigned long, pcg_detail::xsh_rr_mixin<unsigned int, unsigned long>, true, pcg_detail::unique_stream<unsigned long>, pcg_detail::default_multiplier<unsigned long> >&, std::gamma_distribution<double>::param_type const&) ()
#2  0x0000000000634b8d in tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, sampleRoundNonCollapsedMultithreaded_(std::vector<std::pair<TranscriptGroup const, TGValue>, std::allocator<std::pair<TranscriptGroup const, TGValue> > >&, std::vector<bool, std::allocator<bool> >&, std::vector<unsigned long, std::allocator<unsigned long> >&, std::vector<double, std::allocator<double> >&, std::vector<double, std::allocator<double> >&, Eigen::Matrix<double, -1, 1, 0, -1, 1>&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> >&, std::vector<unsigned int, std::allocator<unsigned int> >&)::{lambda(tbb::blocked_range<unsigned long> const&)#2}, tbb::auto_partitioner const>::execute() ()
#3  0x00007fcb8d667fdd in ?? () from /usr/lib/x86_64-linux-gnu/libtbb.so.2
#4  0x00007fcb8d6651f0 in ?? () from /usr/lib/x86_64-linux-gnu/libtbb.so.2
#5  0x0000000000632eba in sampleRoundNonCollapsedMultithreaded_(std::vector<std::pair<TranscriptGroup const, TGValue>, std::allocator<std::pair<TranscriptGroup const, TGValue> > >&, std::vector<bool, std::allocator<bool> >&, std::vector<unsigned long, std::allocator<unsigned long> >&, std::vector<double, std::allocator<double> >&, std::vector<double, std::allocator<double> >&, Eigen::Matrix<double, -1, 1, 0, -1, 1>&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> >&, std::vector<unsigned int, std::allocator<unsigned int> >&) ()
#6  0x000000000063936f in bool CollapsedGibbsSampler::sample<ReadExperiment>(ReadExperiment&, SalmonOpts&, std::function<bool (std::vector<double, std::allocator<double> > const&)>&, unsigned int) ()
#7  0x000000000065d783 in salmonQuantify(int, char**) ()
#8  0x000000000057dbcf in main ()
Detaching from program: /home/ryan/bin/salmon, process 29332
DarwinAwardWinner commented 7 years ago

And here's the log of the Salmon run that I got the backtrace from:

Version Info: This is the most recent **development version** of Salmon.
### salmon (mapping-based) v0.7.3
### [ program ] => salmon
### [ command ] => quant
### [ index ] => { /home/ryan/references/hg38/Salmon_index_hg38.analysisSet_knownGene }
### [ libType ] => { SR }
### [ unmatedReads ] => { fastq_files/SRR2454069.fq.gz }
### [ threads ] => { 8 }
### [ seqBias ] => { }
### [ gcBias ] => { }
### [ useVBOpt ] => { }
### [ dumpEq ] => { }
### [ dumpEqWeights ] => { }
### [ geneMap ] => { /home/ryan/references/hg38/Salmon_index_hg38.analysisSet_knownGene/genemap.txt }
### [ output ] => { salmon_temp/REF/SRR2454069 }
### [ auxDir ] => { aux_info }
### [ numGibbsSamples ] => { 10 }
Logs will be written to salmon_temp/REF/SRR2454069/logs
[2016-12-15 15:58:50.157] [jointLog] [info] parsing read library format
[2016-12-15 15:58:50.157] [jointLog] [info] There is 1 library.
[2016-12-15 15:58:50.189] [jointLog] [info] Loading Quasi index
[2016-12-15 15:58:50.189] [jointLog] [info] Loading 32-bit quasi index
[2016-12-15 15:58:50.189] [stderrLog] [info] Loading Suffix Array
[2016-12-15 15:58:50.513] [stderrLog] [info] Loading Transcript Info
[2016-12-15 15:58:50.599] [stderrLog] [info] Loading Rank-Select Bit Array
[2016-12-15 15:58:50.661] [stderrLog] [info] There were 182608 set bits in the bit array
[2016-12-15 15:58:50.677] [stderrLog] [info] Computing transcript lengths
[2016-12-15 15:58:50.677] [stderrLog] [info] Waiting to finish loading hash
[2016-12-15 15:58:50.677] [stderrLog] [info] Done loading index
[2016-12-15 15:58:50.677] [jointLog] [info] done
[2016-12-15 15:58:50.677] [jointLog] [info] Index contained 182608 targets
[2016-12-15 15:58:51.587] [jointLog] [warning] Fragment GC bias correction is currently *experimental* in single-end libraries.  Please use this option with caution.

processed 16500000 fragments
hits: 44017772; hits per frag:  2.67057

[2016-12-15 16:01:44.937] [jointLog] [info] Computed 119318 rich equivalence classes for further processing
[2016-12-15 16:01:44.937] [jointLog] [info] Counted 12227080 total reads in the equivalence classes
[2016-12-15 16:01:44.948] [jointLog] [info] Mapping rate = 72.5194%

[2016-12-15 16:01:44.948] [jointLog] [info] finished quantifyLibrary()
[2016-12-15 16:01:44.949] [jointLog] [info] Starting optimizer
[2016-12-15 16:01:45.059] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2016-12-15 16:01:45.075] [jointLog] [info] iteration = 0 | max rel diff. = 261.892
[2016-12-15 16:01:45.248] [jointLog] [info] iteration 11, adjusting effective lengths to account for biases
[2016-12-15 16:11:15.738] [jointLog] [info] Computed expected counts (for bias correction)
[2016-12-15 16:11:15.739] [jointLog] [info] processed bias for 0.0% of the transcripts
[2016-12-15 16:13:07.074] [jointLog] [info] processed bias for 10.0% of the transcripts
[2016-12-15 16:14:57.019] [jointLog] [info] processed bias for 20.0% of the transcripts
[2016-12-15 16:16:40.365] [jointLog] [info] processed bias for 30.0% of the transcripts
[2016-12-15 16:18:25.798] [jointLog] [info] processed bias for 40.0% of the transcripts
[2016-12-15 16:20:13.944] [jointLog] [info] processed bias for 50.0% of the transcripts
[2016-12-15 16:21:52.350] [jointLog] [info] processed bias for 100.0% of the transcripts
[2016-12-15 16:21:53.854] [jointLog] [info] iteration = 100 | max rel diff. = 0.250674
[2016-12-15 16:22:12.498] [jointLog] [info] iteration = 200 | max rel diff. = 0.251947
[2016-12-15 16:22:14.261] [jointLog] [info] iteration = 300 | max rel diff. = 0.377281
[2016-12-15 16:22:15.769] [jointLog] [info] iteration = 400 | max rel diff. = 0.121203
[2016-12-15 16:22:17.427] [jointLog] [info] iteration = 500 | max rel diff. = 0.0203027
[2016-12-15 16:22:18.761] [jointLog] [info] iteration = 583 | max rel diff. = 0.00813273
[2016-12-15 16:22:18.773] [jointLog] [info] Finished optimizer
[2016-12-15 16:22:18.773] [jointLog] [info] writing output

[2016-12-15 16:22:19.744] [jointLog] [info] Starting Gibbs Sampler 1 week
DarwinAwardWinner commented 7 years ago

Also, I'm leaving this process running, so I can reattach to it and do more things to it.

rob-p commented 7 years ago

Whoaa! Ok. This is something. It's hanging in a place where we have no client (i.e., Salmon) threading code. It's inside an Intel TBB for loop. But, according to gdb, the TBB libraries that are executing at this point are system ones (/usr/lib/x86_64-linux-gnu/libtbb.so.2). Also interesting is that the full backtrace is into a call to the logarithm function, which shouldn't block. Could you spin up another Salmon process but with Salmon's copy of TBB in the LD_LIBRARY_PATH before the system one, just to make sure that's not the issue? It's likely not, but it's worth being sure. Whatever ends up causing this, I have a feeling it will be something very strange.

DarwinAwardWinner commented 7 years ago

Hmm. That's weird. It looks like the build process built against its own copy of TBB, but then at run time, it uses the system copy. All I did was copy the salmon executable into ~/bin:

$ which salmon
/home/ryan/bin/salmon
$ ldd `which salmon`
        linux-vdso.so.1 =>  (0x00007ffd8d9c0000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f20010a9000)
        libtbb.so.2 => /usr/lib/x86_64-linux-gnu/libtbb.so.2 (0x00007f2000e6b000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f2000c49000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2000a41000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2000737000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2000521000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2000158000)
        /lib64/ld-linux-x86-64.so.2 (0x000055aeef1e1000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1ffff53000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1fffbd1000)
$ ldd ~/src/salmon/build/src/salmon
        linux-vdso.so.1 =>  (0x00007fffdc2d7000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f990745d000)
        libtbb.so.2 => /home/ryan/src/salmon/build/src/../../external/install/lib/libtbb.so.2 (0x00007f990722f000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f990700d000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f9906e05000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9906afb000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f99068e5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f990651c000)
        /lib64/ld-linux-x86-64.so.2 (0x0000564b94030000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9906317000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f9905f95000)
$ md5sum `which salmon`  ~/src/salmon/build/src/salmon
90831f998ff897969da448043c590f61  /home/ryan/bin/salmon
90831f998ff897969da448043c590f61  /home/ryan/src/salmon/build/src/salmon

I don't know enough about how dynamic linking works to explain this.

Anyway, I did export LD_LIBRARY_PATH=/home/ryan/src/salmon/external/install/lib and now I'm running while true; do salmon ...; done again.

DarwinAwardWinner commented 7 years ago

Well, I was just in the middle of writing a comment saying "it's been running for 6 hours with no hang, I don't think it's gonna happen", and then it just hung. Here's the backtrace from gdb:

$ gdb -ex "attach $(pgrep salmon | head -1)" -ex bt -ex detach -ex quit
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 29153
[New LWP 29155]
[New LWP 29156]
[New LWP 29157]
[New LWP 18084]
[New LWP 18085]
[New LWP 18086]
[New LWP 18087]
[New LWP 18088]
[New LWP 18089]
[New LWP 18090]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x0000000000578b90 in __log_finite@plt ()
#0  0x0000000000578b90 in __log_finite@plt ()
#1  0x0000000000637ccc in double std::gamma_distribution<double>::operator()<pcg_detail::engine<unsigned int, unsigned long, pcg_detail::xsh_rr_mixin<unsigned int, unsigned long>, true, pcg_detail::unique_stream<unsigned long>, pcg_detail::default_multiplier<unsigned long> > >(pcg_detail::engine<unsigned int, unsigned long, pcg_detail::xsh_rr_mixin<unsigned int, unsigned long>, true, pcg_detail::unique_stream<unsigned long>, pcg_detail::default_multiplier<unsigned long> >&, std::gamma_distribution<double>::param_type const&) ()
#2  0x0000000000634b8d in tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, sampleRoundNonCollapsedMultithreaded_(std::vector<std::pair<TranscriptGroup const, TGValue>, std::allocator<std::pair<TranscriptGroup const, TGValue> > >&, std::vector<bool, std::allocator<bool> >&, std::vector<unsigned long, std::allocator<unsigned long> >&, std::vector<double, std::allocator<double> >&, std::vector<double, std::allocator<double> >&, Eigen::Matrix<double, -1, 1, 0, -1, 1>&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> >&, std::vector<unsigned int, std::allocator<unsigned int> >&)::{lambda(tbb::blocked_range<unsigned long> const&)#2}, tbb::auto_partitioner const>::execute() ()
#3  0x00007f20171ca492 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f1fd8dc0c00, parent=..., child=<optimized out>)
    at ../../src/tbb/custom_scheduler.h:469
#4  0x00007f20171c85a0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x7f1fd8dc0c00, first=..., next=@0x7f1fd8db7d38: 0x7f1fd8db7340)
    at ../../src/tbb/scheduler.cpp:649
#5  0x0000000000632eba in sampleRoundNonCollapsedMultithreaded_(std::vector<std::pair<TranscriptGroup const, TGValue>, std::allocator<std::pair<TranscriptGroup const, TGValue> > >&, std::vector<bool, std::allocator<bool> >&, std::vector<unsigned long, std::allocator<unsigned long> >&, std::vector<double, std::allocator<double> >&, std::vector<double, std::allocator<double> >&, Eigen::Matrix<double, -1, 1, 0, -1, 1>&, std::vector<double, std::allocator<double> > const&, std::vector<double, std::allocator<double> >&, std::vector<unsigned int, std::allocator<unsigned int> >&) ()
#6  0x000000000063936f in bool CollapsedGibbsSampler::sample<ReadExperiment>(ReadExperiment&, SalmonOpts&, std::function<bool (std::vector<double, std::allocator<double> > const&)>&, unsigned int) ()
#7  0x000000000065d783 in salmonQuantify(int, char**) ()
#8  0x000000000057dbcf in main ()
Detaching from program: /home/ryan/bin/salmon, process 29153
DarwinAwardWinner commented 7 years ago

Anyway, I have both hung processes suspended in a shell, so I can continue inspecting them as required.

DarwinAwardWinner commented 7 years ago

Hmm. It seems the backtrace is not always the same. Here's a zip file of 100 random backtraces, generated by:

for i in `seq 100`; do gdb -ex "attach $(pgrep salmon | head -1)" -ex bt -ex detach -ex quit &> ~/Dropbox/salmon-gdb-bt-ldlib-$i.txt; sleep 0.01; done

salmon-gdb-bt-ldlib.zip

rob-p commented 7 years ago

Ok, I'm thinking more and more it has something to do with TBB (https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/393172). I'm investigating this.

rob-p commented 7 years ago

In other news --- after many many continuous tries, I was finally able to reproduce the issue! I've never been so happy to reproduce a bug. Now to find out if it's our fault or TBB's fault.

rob-p commented 7 years ago

Ok, so using your while true; do salmon; done paradigm, I was able to reproduce the issue on your original dataset after a few runs each time. This has made working on the issue much easier. After a lot of debugging and walking through the stack, I'm almost certain the issue is a "rare" deadlock bug in the Intel TBB library. They mention having fixed this in the latest release. So, I've updated our CMake file to fetch that release. After upgrading to this version of TBB, I've let the data set run all night long (~8 hours so far) and have not seen the hanging behavior. I'll let it keep running for a while, but could you try pulling the latest commit from develop and see if that resolves the problem for you? You'll need to make sure you clean the build directory and remove the previous CMakeCache.txt files so that it will pull in and build the new TBB. You'll also need to make sure that these TBB libraries are the ones that are being used by Salmon. Please let me know if this fixes the hanging for you as well (and I'll let you know if I see it again).

DarwinAwardWinner commented 7 years ago

Ok, now it hangs using 2 cores (200% CPU), and when I did the 100 backtraces thing with gdb, I got the following 4 unique backtraces with the indicated frequencies:

Since it's now running on 2 cores, I also did a backtrace of the other thread, which produces backtraces more reminiscent of the original report, like this: salmon-gdb-bt-ldlib-thread2-1.txt. They're more varied, so once again here's a zip of all 100 of them: thread2-100-backtraces.zip

DarwinAwardWinner commented 7 years ago

Verification that I'm running with the right version of TBB:

$ ls ~/src/salmon/external/tbb* -d
/home/ryan/src/salmon/external/tbb-2017_U3  /home/ryan/src/salmon/external/tbb-2017_U3.tgz
$ ldd `which salmon`
        linux-vdso.so.1 =>  (0x00007ffc739fd000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5745a93000)
        libtbb.so.2 => /home/ryan/src/salmon/external/install/lib/libtbb.so.2 (0x00007f5745864000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f5745642000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f574543a000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f5745130000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f5744f1a000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5744b51000)
        /lib64/ld-linux-x86-64.so.2 (0x000055d36c7b0000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f574494c000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f57445ca000)
rob-p commented 7 years ago

Ok, damn. Mine's been running for ~11.5 hours now with no hang. This is definitely a TBB issue. I wonder why it only happens in the Gibbs sampler though. Perhaps I'll just re-engineer that code to stop using tbb and just use a plain old thread pool.

DarwinAwardWinner commented 7 years ago

For the record, it ran successfully 4 times before hanging on the 5th time. That's more or less in line with what was happening before.

rob-p commented 7 years ago

Ok, trying something rather desperate. Can you try commit 8977adf7f2ded7ec5ba7096448e70bebaf762d2f (just pushed). The gibbs will run a bit slower, since I moved sampling from the gamma distribution outside of a parallel for. I just want to isolate the hang to that specific parallel_for loop. If this resolves the problem, I can think of another way to speed up that sampling.