galaxyproject / usegalaxy-playbook

Ansible Playbook for usegalaxy.org
Academic Free License v3.0
30 stars 24 forks source link

NCBI fasterq_dump tool + fastq_dump fail at usegalaxy.org and usegalaxy.eu #297

Closed jennaj closed 3 years ago

jennaj commented 4 years ago

Problem:

Prior and most current version fail. Ok at usegalaxy.eu

Tested with form sample accessions plus others (included in various tutorials). Data downloads fine from EBI SRA (any way retrieved directly with that Get Data tool, or copy of FTP URL into Upload tool).

Tools:

toolshed.g2.bx.psu.edu/repos/iuc/sra_tools/fasterq_dump toolshed.g2.bx.psu.edu/repos/iuc/sra_tools/fastq_dump

Tests:

Data is tagged to clarify each test condition

Bug report and error message in history "info" content varies but seems to be a problem with "prefetch" across all.

Fixed default configuration

2020-04-24T17:18:54 prefetch.2.10.3: 1) Downloading 'SRR925743'...
2020-04-24T17:18:54 prefetch.2.10.3:  Downloading via https...
2020-04-24T17:20:58 prefetch.2.10.3:  https download succeed

ping @natefoo @davebx

jennaj commented 4 years ago

Update: Does not impact all accessions

Workaround for end-users:

If your accessions are failing, try using the alternative fastq SRA retrieval tool Get Data > EBI SRA.

FAQ: https://galaxyproject.org/support/loading-data/#get-data-ebi-sra

jennaj commented 3 years ago

The above was putatively fixed for a while but there are new problems now. One looks like the prior problem. Maybe it was never really fully addressed (?)

  1. One is a config issue at org that needs to be fixed
  2. Second is a tool issue (bad path) that presents at both servers/both tools

usegalaxy.org >> job handler issues for fasterq_dump usegalaxy.org + usegalaxy.eu >> path problem with both fasterq_dump and fastq_dump. Doesn't fail on all data, just some, including the example accession on the tool forms: SRR925743

version 2.10.8+galaxy0 for both (most current)

fasterq_dump/org

Job Information
Encountered an unhandled exception while caching job destination dynamic rule.

fastq_dump/org + eu

stderr
2021-01-20T23:40:30 prefetch.2.10.8 int: connection failed while opening file within cryptographic module - Cannot KNSManagerMakeClientRequest: https://sra-download.ncbi.nlm.nih.gov/traces/refseq/NT_167214.1

^^ weird path, also the sra archive here can be URL Uploaded and the reads extracted (using these same tools). source: https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR925743/SRR925743.1

Test histories https://usegalaxy.org/u/jen-galaxyproject/h/test-ncbi-sra https://usegalaxy.eu/u/jenj/h/test-ncbi-sra

ping @natefoo for org and @bgruening is this a papercut, tool config issue, or tool bug? Should I ticket against the tool's repo? Related to NCBI reorganizing data into the cloud, right?

mvdbeek commented 3 years ago

Jen, what are the path issues ? SRA has broken accessions, if the tool is broken all accessions are broken, so if you could test a known working accession that would help us.

Encountered an unhandled exception while caching job destination dynamic rule.

This is Galaxy config issue indeed.

jennaj commented 3 years ago

Thanks, @mvdbeek -- the config issues were fixed by @natefoo about 5 hours ago. I probably talked too fast in the meeting. Have a test rerun to close that part out (still queued -- clusters are busy). So, that is not the root problem.

Instead, there is a problem with the tools at both servers, and it doesn't impact all accessions. Seems to be limited to some human, not any non-human (that I have found or we have had reported by users), and what is going wrong shows up in the stdout. I can't figure out what the difference is between the fail vs success accessions, but it does look like a bug. Not sure if a problem wrapper or not.

Could you give feedback? If a wrapper problem, I can ticket it at IUC repo. Or if you just want me to do that anyway, can do that. If turns out to be something we won't address, and just accept that some accessions fail, we'll still need to pick a new accession example for the tool form, update the wrapper, and get it all updated/installed everywhere.

The original tests all used the example accession on the tool form (SRR925743, human, our example). Has always worked before. I tested other accessions and looked through bug reports. This has been seen before but didn't impact our example accession before. That's why I am digging a bit more now.

The stderr reported back is odd. The command line uses the correct accession.

2021-01-20T23:40:30 prefetch.2.10.8 int: connection failed while opening file within cryptographic module - Cannot KNSManagerMakeClientRequest: https://sra-download.ncbi.nlm.nih.gov/traces/refseq/NT_167214.1

Thestdout in the failed fastq_dump jobs show the exact failure point. Successful jobs are different. Failed seem to be pulling the human genome chromosomes as a "dependency". Thought that was fixed... but something similar seems to be going on again. And now our tool form example accession (human) also has that problem. Users are reporting the example accession errors when they self-test (along with their original problematic accessions), and are confused.

fasterq_dump does not report a stdout but if one tool fails, the other does for the direct data retrieval option. The sra archive read extraction option: success/fail does differ between the two tools, even when using that same accession (SRR925743, human, our example), but is consistent between the two servers (not a server config issue, or if it is, both have the same problem).

Failed jobs report this, then a bunch more log where the human genome chroms start to download, but that fails of course. Tool version is 2.10.8+galaxy0.

2021-01-21T20:02:27 prefetch.2.10.8: 'SRR925766' has 93 unresolved dependencies

Successful jobs report this. Tool version is 2.10.8+galaxy0.

2021-01-21T17:21:10 prefetch.2.10.8: 'SRR11772204' has 0 unresolved dependencies

Full reports. In short, some accessions, and only human but not ALL human, report those "unresolved dependencies" (?)


More details if needed

I ran more tests at each server, both tools, using the other example accession from the tool form (ERR343809, not-human) and one of the accessions in this tutorial: https://training.galaxyproject.org/training-material/topics/variant-analysis/tutorials/sars-cov-2/tutorial.html (SRR11772204, not-human). Both accessions worked, both tools, both servers.

So -- the only difference I noticed at that point is that the failed accessions were human and the successes are not human. Given the current stderr & stdout, some false dependency seems to be trigging an (attempted) download of the human genome when certain human accessions are queried.

To see if could narrow it down, tested to see if the original submission source is a factor SRR vs ERR, or human vs not, both tools at both servers. Also tossed in the rerun of the user-reported SRR (SRR925766, human) and different human SRR than what is on the tool form (SRR13043505, human). I won't break it all down here since the results are the same so far between the two tools & servers, and all dataset are tagged.

The results are consistent -- some human-only accessions have an odd dependency triggered. I think that's a problem.

  1. (ERR3197112, human) -- 1 success, 3 still executing
  2. (SRR11184223, not human) -- works
  3. different human SRR (SRR13043505, human) -- works
  4. rerun prior failed user-reported human SRR (SRR925766, human) -- 2 failed, 2 executing

Same tools, but extracting from a sra archive instead of directly fetching from NCBI -- not sure yet

Downloading the sra archive and extracting fastq from it, with both tools/both servers, might have some of the same problems. Most jobs are still running, but one failed already and one success. Looks like fastq_dump can pull the reads out but fasterq_dump cannot.

mvdbeek commented 3 years ago

Given the current stderr & stdout, some false dependency seems to be trigging an (attempted) download of the human genome when certain human accessions are queried.

How did you reach this conclusion, I don't see anything related to that ? If some accessions work I'm gonna say the tool is working fine. I did submit a job on .org that failed. The job ran on stampede, with --partition=long --nodes=1 --account=TG-MCB140147 --ntasks=68 --time=72:00:00. That is completely over-allocated, this tool doesn't need 72 hours or 68 tasks. If it takes that long something else is wrong. I doubt you'll see any performance improvement beyond 4 cores (which are only used for pigz), and 72 hours of downloading from the SRA is crazy. A conservative estimate of 20MB/s puts this at more than 5 terabyte of data one is able to download in that span. And that's highly compressed, in reality that'd be even larger fastq.gz files.

mvdbeek commented 3 years ago

The same accession that failed I was able to download with the same tool mounted using .org's cvmfs, so something's up with main's config.

jennaj commented 3 years ago

Thanks for reviewing @mvdbeek

The same accession that failed I was able to download with the same tool mounted using .org's cvmfs, so something's up with main's config. (marius)

Both .org and .eu have the same problem, for the same accessions, and only human.

When the tool is clicked on from the tool panel, v 2.10.8+galaxy0 loads up with no warning that it isn't the most current version. I'd used that for testing. The versions menu does have 2.10.8+galaxy1 and 2.10.8+galaxy2 listed. Which is really the most current?

The toolshed has "2.10.8+galaxy0" as the most current version for all.

https://toolshed.g2.bx.psu.edu/repository?repository_id=cb2d31dfab58ee88

The last change in the tool repo was 5 months ago, and this commit looks like it isn't bumping the version up, but down. Maybe this is the root of the problem?

https://github.com/galaxyproject/tools-iuc/commit/0ed1a310852a5ffa99706b8907bee43806706dc6


Given the current stderr & stdout, some false dependency seems to be trigging an (attempted) download of the human genome when certain human accessions are queried. (jen) How did you reach this conclusion, I don't see anything related to that ? (marius)

Failed jobs report this, then a bunch more log where the human genome chroms start to download, but that fails of course. Tool version is 2.10.8+galaxy0.

2021-01-21T20:02:27 prefetch.2.10.8: 'SRR925766' has 93 unresolved dependencies 2021-01-20T22:59:25 prefetch.2.10.8: 'SRR925743' has 93 unresolved dependencies

Successful jobs report this. Tool version is 2.10.8+galaxy0.

2021-01-21T17:21:10 prefetch.2.10.8: 'SRR11772204' has 0 unresolved dependencies

Full logs with the "unresolved dependencies" highlighted:

Rerun (failed job)stdout using version 2.10.8+galaxy0 run with (SRR925766, human): https://gist.github.com/jennaj/95d8df7abd369fc4b93db278d316fece

Current (failed job) stdout using version 2.10.8+galaxy0 run with (SRR925743, human, our example): https://gist.github.com/jennaj/66831d515013018c706638a96bbddddf

Current (success job) stdout using version 2.10.8+galaxy0 run with (SRR13043505, human): https://gist.github.com/jennaj/e58eafa48a27481fa64a9934d7dbe8de

mvdbeek commented 3 years ago

These are ncbi issues. Unresolved dependencies are not Galaxy dependencies, these are probably contigs that couldn't be downloaded. I'd report this with ncbi.

jennaj commented 3 years ago

Actually, I just tried using version 2.10.8+galaxy2 at both servers. Seems to be working.

But the GUI is reporting that is not the most current tool version, both servers.

these are probably contigs that couldn't be downloaded. I'd report this with ncbi.

We don't want or need those contigs downloaded, just the accession queried.

Maybe this is just a tool version issue in the repo, cascading into the ToolShed, and then the pubic servers?

Screen Shot 2021-01-22 at 9 46 51 AM Screen Shot 2021-01-22 at 9 46 26 AM Screen Shot 2021-01-22 at 9 51 13 AM Screen Shot 2021-01-22 at 9 50 54 AM
jennaj commented 3 years ago

OK, 2.10.8+galaxy0 is the most current version. @mvdbeek agree now that it is probably a problem at NCBI. The human genome contigs should not be pulling as dependencies when downloading human read accessions.

2021-01-22T17:45:21 prefetch.2.10.7: 1) Downloading 'SRR925743'...
2021-01-22T17:45:22 prefetch.2.10.7:  Downloading via HTTPS...
2021-01-22T17:48:05 prefetch.2.10.7:  HTTPS download succeed
2021-01-22T17:48:05 prefetch.2.10.7: 1.2) Downloading 'SRR925743.vdbcache'...
2021-01-22T17:48:05 prefetch.2.10.7:  Downloading via HTTPS...
2021-01-22T17:48:07 prefetch.2.10.7:  HTTPS download succeed
2021-01-22T17:48:07 prefetch.2.10.7:  'SRR925743.vdbcache' is valid
2021-01-22T17:48:07 prefetch.2.10.7: 1.2) 'SRR925743.vdbcache' was downloaded successfully
2021-01-22T17:48:07 prefetch.2.10.7: 1) 'SRR925743' was downloaded successfully

2021-01-22T17:48:34 prefetch.2.10.7: 'SRR925743' has 93 unresolved dependencies

2021-01-22T17:48:34 prefetch.2.10.7: 2) Downloading 'ncbi-acc:CM000663.1?vdb-ctx=refseq'...
2021-01-22T17:48:34 prefetch.2.10.7:  Downloading via HTTPS...
2021-01-22T17:48:36 prefetch.2.10.7:  HTTPS download succeed
2021-01-22T17:48:36 prefetch.2.10.7: 2) 'ncbi-acc:CM000663.1?vdb-ctx=refseq' was downloaded successfully
2021-01-22T17:48:36 prefetch.2.10.7: 3) Downloading 'ncbi-acc:CM000664.1?vdb-ctx=refseq'...
2021-01-22T17:48:36 prefetch.2.10.7:  Downloading via HTTPS...
2021-01-22T17:48:38 prefetch.2.10.7:  HTTPS download succeed
2021-01-22T17:48:38 prefetch.2.10.7: 3) 'ncbi-acc:CM000664.1?vdb-ctx=refseq' was downloaded successfully
2021-01-22T17:48:38 prefetch.2.10.7: 4) Downloading 'ncbi-acc:CM000665.1?vdb-ctx=refseq'...
2021-01-22T17:48:38 prefetch.2.10.7:  Downloading via HTTPS...
2021-01-22T17:48:40 prefetch.2.10.7:  HTTPS download succeed
2021-01-22T17:48:40 prefetch.2.10.7: 4) 'ncbi-acc:CM000665.1?vdb-ctx=refseq' was downloaded successfully
2021-01-22T17:48:40 prefetch.2.10.7: 5) Downloading 'ncbi-acc:CM000666.1?vdb-ctx=refseq'...
2021-01-22T17:48:40 prefetch.2.10.7:  Downloading via HTTPS...
2021-01-22T17:48:50 prefetch.2.10.7:  HTTPS download failed
2021-01-22T17:48:50 prefetch.2.10.7: 5) failed to download ncbi-acc:CM000666.1?vdb-ctx=refseq

2.10.7+galaxy2 fails but in a different way. Probably expected.

org

Prefetch attempt 1 of 3 exited with code 1
Prefetch attempt 2 of 3 exited with code 1
Prefetch attempt 3 of 3 exited with code 1
There are 0 fastq

eu

2021-01-22T17:47:22 prefetch.2.10.7: 1) Downloading 'SRR925743'...
2021-01-22T17:47:22 prefetch.2.10.7:  Downloading via HTTPS...
2021-01-22T17:55:48 prefetch.2.10.7:  HTTPS download failed
2021-01-22T17:55:48 prefetch.2.10.7: 1) failed to download SRR925743

Not sure how to communicate issues with 2.10.8+galaxy0 to users -- and we had a bug report today at ORG about one of the sars-cov-2 accessions not downloading (2.10.8+galaxy0) with an error that is similar to 2.10.7+galaxy2 errors.

Tool ID | toolshed.g2.bx.psu.edu/repos/iuc/sra_tools/fasterq_dump/2.10.8+galaxy0
Tool Version | 2.10.8+galaxy0
Job PID or DRM id | XXXXX
Job Tool Version | This sra toolkit installation has not been configured. Before continuing, please run: vdb-config --interactive For more information, see https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/

stderr
ls: cannot access *.fastq: No such file or directory
ls: cannot access *.fastq: No such file or directory
ls: cannot access *.fastq: No such file or directory
ls: cannot access *.fastq: No such file or directory

stdout
2021-01-22T13:19:03 fasterq-dump.2.10.8 err: connection failed while opening file within cryptographic module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/traces/sra74/SRR/011393/SRR11667145'
2021-01-22T13:19:03 fasterq-dump.2.10.8 err: invalid accession 'SRR11667145'
Prefetch attempt 1 of 3 exited with code 1
2021-01-22T13:19:31 fasterq-dump.2.10.8 err: connection failed while opening file within cryptographic module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/traces/sra74/SRR/011393/SRR11667145'
2021-01-22T13:19:31 fasterq-dump.2.10.8 err: invalid accession 'SRR11667145'
Prefetch attempt 2 of 3 exited with code 1
2021-01-22T13:19:43 fasterq-dump.2.10.8 err: connection failed while opening file within cryptographic module - error with https open 'https://sra-download.ncbi.nlm.nih.gov/traces/sra74/SRR/011393/SRR11667145'
2021-01-22T13:19:43 fasterq-dump.2.10.8 err: invalid accession 'SRR11667145'
Prefetch attempt 3 of 3 exited with code 1
jennaj commented 3 years ago

Well, I reran both SRR925743 (human, one of the tool form examples) and SRR11667145 (non-human, sars-cov-2) at org + eu servers this morning. Both tools were successful using the latest versions (2.10.8+galaxy0).

Maybe NCBI SRA was having issues that are now resolved. Or, may come up again if their service is busy (?). A similar case comes up with EBI SRA at times (retrieval fails with weird connection or not-found type of error). We cannot control either -- the solution is to rerun at a later time. Closing this out.

jennaj commented 3 years ago

Resolved

Appears to be related to the prior issues. Success in the test history 1/25. The end-user instructed to rerun at GHelp.


Problem reported again. These are human data. https://help.galaxyproject.org/t/empty-collections-resulting-from-ncbi-sra-tool-with-accession-list-input/5247

From the Ghelp Q&A https://help.galaxyproject.org/t/empty-collections-resulting-from-ncbi-sra-tool-with-accession-list-input/5247

Test using their accessions, both single and a short list, running in here: https://usegalaxy.eu/u/jenj/h/srr8307998

If that fails, we can open this up again until actually resolved.

natefoo commented 3 years ago

@mvdbeek:

The job ran on stampede, with --partition=long --nodes=1 --account=TG-MCB140147 --ntasks=68 --time=72:00:00. That is completely over-allocated, this tool doesn't need 72 hours or 68 tasks. If it takes that long something else is wrong. I doubt you'll see any performance improvement beyond 4 cores (which are only used for pigz), and 72 hours of downloading from the SRA is crazy. A conservative estimate of 20MB/s puts this at more than 5 terabyte of data one is able to download in that span. And that's highly compressed, in reality that'd be even larger fastq.gz files.

68 cores is just the default since Stampede 2 allocates entire nodes. This can of course be decreased if that many cores slows the tool down. However, 72 hours was intentional in 38a029ed4954b2c2b1584b8df3b4e18584f47ea2, but I have no idea what the context was. It's probably somewhere on Slack or Gitter.

natefoo commented 3 years ago

somewhere on Slack

2020-10-02

17:14 @nekrut: @natefoo -> can you increase wall time for fasterq-dump?

2020-10-04

14:31 @natefoo: @nekrut sorry for the wait, I have bumped it from 36 to 72 hours
14:31 @natefoo: A bit concerning it's taking that long though