Open IvoryC opened 4 years ago
(copied from slack)
jyoun144
While downloading 675 samples from the Rifaximin dataset (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SAMN07277992&o=acc_s%3Aa), I intermittently observed the following segmentation fault error:
12:20
2020-05-11T09:16:36 fasterq-dump.2.9.4 err: cmn_iter.c cmn_read_uint8_array( #451057 ).VCursorCellDataDirect() -> RC(rcNS,rcFile,rcReading,rcTransfer,rcIncomplete)
2020-05-11T09:16:36 fasterq-dump.2.9.4 err: row #451057 : READ.len(302) != QUALITY.len(0)
2020-05-11T09:16:36 fasterq-dump.2.9.4 fatal: SIGNAL - Segmentation fault
12:22
Per the following reference, the segmentation fault was resolved by sra-tools version 2.10.3. I was able to load sra-tools version 2.10.5 on the copperhead cluster, and I no longer observed the segmentation fault.
https://github.com/ncbi/sra-tools/wiki
We do still need to give BioLockJ a mechanism to recover from transient errors. If sra-tools 2.10 is a really solid fix for this particular case, then maybe we can get away with relying on the user-restart recovery option and have a bit of code that checks the version of fasterq-dump and prints a warning if it is less than 2.10.3 .
...dang... The process was going so well for me with sra-tools v2.10.5 ... until....
. . .
spots read : 30,496,171
reads read : 60,992,342
reads written : 60,992,342
spots read : 19,649,118
reads read : 39,298,236
reads written : 39,298,236
2020-06-16T15:15:53 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #8972904 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-06-16T15:15:54 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #16030251 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-06-16T15:15:54 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #19451930 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-06-16T15:16:34 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #2587664 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-06-16T15:16:35 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #12835168 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-06-16T15:16:35 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #6429927 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
fasterq-dump (PID 30944) quit with error code 3
And ... again...
spots read : 19,605,923
reads read : 39,211,846
reads written : 39,211,846
spots read : 9,267,700
reads read : 18,535,400
reads written : 18,535,400
spots read : 4,387
reads read : 8,774
reads written : 8,774
spots read : 5,587
reads read : 11,174
reads written : 11,174
2020-06-16T20:00:35 fasterq-dump.2.10.5 err: connection failed while opening file within cryptographic module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/traces/sra73/SRR/008996/SRR9212714'
2020-06-16T20:00:35 fasterq-dump.2.10.5 err: cmn_iter.c cmn_get_acc_type( 'SRR9212714', 'SEQUENCE', 'NAME' ).VDBManagerOpenDBRead() -> RC(rcKrypto,rcFile,rcOpening,rcConnection,rcFailed)
fasterq-dump (PID 10402) quit with error code 3
spots read : 54,998
reads read : 109,996
reads written : 109,996
spots read : 53,588
reads read : 107,176
reads written : 107,176
spots read : 49,342
reads read : 98,684
reads written : 98,684
spots read : 45,647
reads read : 91,294
reads written : 91,294
2020-06-16T20:58:52 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_uint8_array( #13357 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-06-16T20:58:52 fasterq-dump.2.10.5 err: row #13357 : READ.len(522) != QUALITY.len(0) (F)
2020-06-16T20:58:52 fasterq-dump.2.10.5 fatal: SIGNAL - Segmentation fault
fasterq-dump (PID 14461) quit with error code 1
exit 1 this time.
Got through a whole lot... and then....
spots read : 26,846,770
reads read : 53,693,540
reads written : 53,693,540
spots read : 74,662
reads read : 149,324
reads written : 149,324
spots read : 22,693,254
reads read : 45,386,508
reads written : 45,386,508
2020-06-17T00:31:21 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #10988311 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-06-17T00:31:22 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_uint8_array( #2974965 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted)
2020-06-17T00:31:22 fasterq-dump.2.10.5 err: row #2974965 : READ.len(297) != QUALITY.len(0) (F)
2020-06-17T00:31:22 fasterq-dump.2.10.5 fatal: SIGNAL - Segmentation fault
fasterq-dump (PID 4636) quit with error code 1
The SraDownload module has a major weakness in that it relies one this un-reliable thing called the internet, and it on a remote server that we have no control over. Sometimes something goes wrong.
I think I've gotten status code [3] and it has been an intermittent error (ie, try again and its fine). On the cluster, SraDownload made this log:
and this error:
It downloaded some file successfully, then hit a problem with one (and shut down).
I think a possible solution might be to have the module create a bash function to run its lines, so it has control over the initial handing of the error. If the module is followed in the pipeline by another instance of the SraDownload module, then when it gets an exit status of [3] or [139] it will just return 0, that "everything is fine". Then the next module runs and its the SraModule again, that instance checks to see which files already exist, determines which one need to be downloaded, run scripts to download just those (ie, just the ones that failed the first attempt). If the SraModule sees that the next module in the pipeline is NOT another SraDownload module, then it writes it bash function so that all non-0 exits are returned.
A pipeline can be configured to have the SraDownlaod module several times:
SRA1 might download everything, and then SRA2 and SRA3 will just do nothing. Or, SRA1 might fail a few times, but just return 0 and let SRA2 deal with it. Anything that still fails to download by the end of SRA3 will cause an actual error that stops the pipeline.