BioLockJ-Dev-Team / sheepdog_testing_suite

Test suite for BioLockJ development team.
3 stars 8 forks source link

SraDownload module needs to recover from failed downloads #212

Open IvoryC opened 4 years ago

IvoryC commented 4 years ago

The SraDownload module has a major weakness in that it relies one this un-reliable thing called the internet, and it on a remote server that we have no control over. Sometimes something goes wrong.

I think I've gotten status code [3] and it has been an intermittent error (ie, try again and its fine). On the cluster, SraDownload made this log:

00_SraDownload $cat log/00.0_SraDownload.log 
spots read      : 136,608
reads read      : 273,216
reads written   : 273,216
spots read      : 104,822
reads read      : 209,644
reads written   : 209,644
spots read      : 144,431
reads read      : 288,862
reads written   : 288,862
2020-04-30T18:39:58 fasterq-dump.2.9.4 fatal: SIGNAL - Segmentation fault 
/scratch/ieclabau/pipelines/monkey_4_2020Apr30/00_SraDownload/script/00.0_SraDownload.sh: line 26: 23143 Segmentation fault      (core dumped) ${1}

and this error:

$cat biolockjFailed 
ERROR TYPE:    DirectModuleException
ERROR MESSAGE: SCRIPT FAILED: 00.0_SraDownload.sh_Failures | Line #38 failure status code [ 139 ]:  fasterq-dump -O /projects/afodor_research/ieclabau/afodor/data/SRP139357 SRR6979112

It downloaded some file successfully, then hit a problem with one (and shut down).

I think a possible solution might be to have the module create a bash function to run its lines, so it has control over the initial handing of the error. If the module is followed in the pipeline by another instance of the SraDownload module, then when it gets an exit status of [3] or [139] it will just return 0, that "everything is fine". Then the next module runs and its the SraModule again, that instance checks to see which files already exist, determines which one need to be downloaded, run scripts to download just those (ie, just the ones that failed the first attempt). If the SraModule sees that the next module in the pipeline is NOT another SraDownload module, then it writes it bash function so that all non-0 exits are returned.
A pipeline can be configured to have the SraDownlaod module several times:

#BioModule biolockj.module.getData.sra.SraDownload AS SRA1
#BioModule biolockj.module.getData.sra.SraDownload AS SRA2
#BioModule biolockj.module.getData.sra.SraDownload AS SRA3
#BioModule biolockj.module.classifier.r16s.RdpClassifier
sra.destinationDir=/path/to/dir
sra.sraAccList=SraAccList.txt

SRA1 might download everything, and then SRA2 and SRA3 will just do nothing. Or, SRA1 might fail a few times, but just return 0 and let SRA2 deal with it. Anything that still fails to download by the end of SRA3 will cause an actual error that stops the pipeline.

IvoryC commented 4 years ago

(copied from slack) jyoun144
While downloading 675 samples from the Rifaximin dataset (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SAMN07277992&o=acc_s%3Aa), I intermittently observed the following segmentation fault error: 12:20 2020-05-11T09:16:36 fasterq-dump.2.9.4 err: cmn_iter.c cmn_read_uint8_array( #451057 ).VCursorCellDataDirect() -> RC(rcNS,rcFile,rcReading,rcTransfer,rcIncomplete) 2020-05-11T09:16:36 fasterq-dump.2.9.4 err: row #451057 : READ.len(302) != QUALITY.len(0) 2020-05-11T09:16:36 fasterq-dump.2.9.4 fatal: SIGNAL - Segmentation fault 12:22 Per the following reference, the segmentation fault was resolved by sra-tools version 2.10.3. I was able to load sra-tools version 2.10.5 on the copperhead cluster, and I no longer observed the segmentation fault. https://github.com/ncbi/sra-tools/wiki

IvoryC commented 4 years ago

We do still need to give BioLockJ a mechanism to recover from transient errors. If sra-tools 2.10 is a really solid fix for this particular case, then maybe we can get away with relying on the user-restart recovery option and have a bit of code that checks the version of fasterq-dump and prints a warning if it is less than 2.10.3 .

IvoryC commented 4 years ago

...dang... The process was going so well for me with sra-tools v2.10.5 ... until....

. . .
spots read      : 30,496,171
reads read      : 60,992,342
reads written   : 60,992,342
spots read      : 19,649,118
reads read      : 39,298,236
reads written   : 39,298,236
2020-06-16T15:15:53 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #8972904 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted) 
2020-06-16T15:15:54 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #16030251 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted) 
2020-06-16T15:15:54 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #19451930 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted) 
2020-06-16T15:16:34 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #2587664 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted) 
2020-06-16T15:16:35 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #12835168 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted) 
2020-06-16T15:16:35 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #6429927 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted) 
fasterq-dump (PID 30944) quit with error code 3
IvoryC commented 4 years ago

And ... again...

spots read      : 19,605,923
reads read      : 39,211,846
reads written   : 39,211,846
spots read      : 9,267,700
reads read      : 18,535,400
reads written   : 18,535,400
spots read      : 4,387
reads read      : 8,774
reads written   : 8,774
spots read      : 5,587
reads read      : 11,174
reads written   : 11,174
2020-06-16T20:00:35 fasterq-dump.2.10.5 err: connection failed while opening file within cryptographic module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/traces/sra73/SRR/008996/SRR9212714'
2020-06-16T20:00:35 fasterq-dump.2.10.5 err: cmn_iter.c cmn_get_acc_type( 'SRR9212714', 'SEQUENCE', 'NAME' ).VDBManagerOpenDBRead() -> RC(rcKrypto,rcFile,rcOpening,rcConnection,rcFailed)
fasterq-dump (PID 10402) quit with error code 3
IvoryC commented 4 years ago
spots read      : 54,998
reads read      : 109,996
reads written   : 109,996
spots read      : 53,588
reads read      : 107,176
reads written   : 107,176
spots read      : 49,342
reads read      : 98,684
reads written   : 98,684
spots read      : 45,647
reads read      : 91,294
reads written   : 91,294
2020-06-16T20:58:52 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_uint8_array( #13357 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted) 
2020-06-16T20:58:52 fasterq-dump.2.10.5 err: row #13357 : READ.len(522) != QUALITY.len(0) (F) 
2020-06-16T20:58:52 fasterq-dump.2.10.5 fatal: SIGNAL - Segmentation fault 
fasterq-dump (PID 14461) quit with error code 1

exit 1 this time.

IvoryC commented 4 years ago

Got through a whole lot... and then....

spots read      : 26,846,770
reads read      : 53,693,540
reads written   : 53,693,540
spots read      : 74,662
reads read      : 149,324
reads written   : 149,324
spots read      : 22,693,254
reads read      : 45,386,508
reads written   : 45,386,508
2020-06-17T00:31:21 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_String( #10988311 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted) 
2020-06-17T00:31:22 fasterq-dump.2.10.5 err: cmn_iter.c cmn_read_uint8_array( #2974965 ).VCursorCellDataDirect() -> RC(rcPS,rcCondition,rcWaiting,rcTimeout,rcExhausted) 
2020-06-17T00:31:22 fasterq-dump.2.10.5 err: row #2974965 : READ.len(297) != QUALITY.len(0) (F) 
2020-06-17T00:31:22 fasterq-dump.2.10.5 fatal: SIGNAL - Segmentation fault 
fasterq-dump (PID 4636) quit with error code 1