galaxyproject / usegalaxy-playbook

Ansible Playbook for usegalaxy.org
Academic Free License v3.0
30 stars 24 forks source link

DexSeq (both versions at org) fail with: (33, "HTTP server doesn't seem to support byte ranges. Cannot resume.") #171

Open jennaj opened 5 years ago

jennaj commented 5 years ago

Failure is always (33, "HTTP server doesn't seem to support byte ranges. Cannot resume.") paired with Remote job server indicated a problem running or monitoring this job..

Test history that contains prior successful runs. The other earlier errors in the history were tests to see how the wrapper worked, none are like this error and the usage issue to fix those is (was!) known. Uses tutorial data. https://usegalaxy.org:/u/jen/h/differential-exon-usage-tutorial-starting-data-plus-dexseq

  1. Data 30-31: Original success, v 1.20.1, roundup
  2. Data 32-33: Rerun of 1, fails, stampede (roundup not available rn)
  3. Data 34-35: Rerun of 1, fails, jetstream
  4. Data 36-37: Rerun of 1, upgrade to v 1.24.0.0, fails, jetstream
  5. Data 38: Rerun of 4 with "visualize" set to "no" (no HTML output), fails, jetstream

Galaxy Tool ID: toolshed.g2.bx.psu.edu/repos/iuc/dexseq/dexseq/1.24.0.0

@natefoo @davebx This is new but has come up in the past in a few bug reports (across tools). Seems like always a server side issue.

natefoo commented 5 years ago

This will always run on the local (roundup) cluster unless explicitly instructed otherwise. Is the problem only with non-roundup destinations?

By any chance does it work with different inputs? Even trying the same inputs re-uploaded would be interesting - I can try this if you know it works with other inputs. There is something very strange going on here. The file size of the file it's choking on, at least for my test run of the tool (input dataset with hid 3, others might be problematic on other runs, however it does successfully stage some inputs before this) is 1518867 bytes:

galaxy_main=> select file_size from dataset where id = 21527052;
 file_size 
-----------
   1518867
(1 row)

This is true on disk:

$ ls -l /galaxy-repl/main/files/021/527/dataset_21527052.dat
-rw-r--r-- 1 g2main G-803372 1518867 Oct  2  2017 /galaxy-repl/main/files/021/527/dataset_21527052.dat

And when fetched as a job input (also shows that byte-range requests are supported):

$ curl 'https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat' -v > /tmp/dataset_21527052.dat
< HTTP/1.1 200 OK
< Server: nginx/1.12.2
< Date: Wed, 24 Oct 2018 18:05:52 GMT
< Content-Type: application/json
< Content-Length: 1518867
< Last-Modified: Mon, 02 Oct 2017 20:12:01 GMT
< Connection: keep-alive
< Vary: Accept-Encoding
< cache-control: max-age=0,no-cache,no-store
< ETag: "59d29d91-172d13"
< Strict-Transport-Security: max-age=15552000; includeSubdomains
< Accept-Ranges: bytes
* Curl_http_done: called premature == 0
* Connection #0 to host galaxy-web-04.galaxyproject.org left intact
$ ls -l /tmp/dataset_21527052.dat
-rw-r--r-- 1 nate nate 1518867 Oct 24 14:05 /tmp/dataset_21527052.dat

Pulsar fetches it to the end but for some reason thinks it failed and tries to resume:

2018-10-24 17:54:51,952 DEBUG [pulsar.managers.staging.pre][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] Staging input 'dataset_21527052.dat' via FileAction[url=https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input path=/galaxy-repl/main/files/021/527/dataset_21527052.dataction_type=remote_transfer] to /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat
2018-10-24 17:54:51,952 INFO  [pulsar.client.transport.curl][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] transfer of https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input will resume at 1518867 bytes

It does fetch a little more:

2018-10-24 17:54:52,168 INFO  [pulsar.managers.util.retry][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] Failed to execute action[Staging input 'dataset_21527052.dat' via FileAction[url=https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input path=/galaxy-repl/main/files/021/527/dataset_21527052.dataction_type=remote_transfer] to /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat], retrying in 2.0 seconds.
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib/python2.7/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
    return fun(*args, **kwargs)
  File "/srv/pulsar/main/venv/lib/python2.7/site-packages/pulsar/managers/staging/pre.py", line 17, in <lambda>
    action_executor.execute(lambda: action.write_to_path(path), "action[%s]" % description)
  File "/srv/pulsar/main/venv/lib/python2.7/site-packages/pulsar/client/action_mapper.py", line 436, in write_to_path
    get_file(self.url, path)
  File "/srv/pulsar/main/venv/lib/python2.7/site-packages/pulsar/client/transport/curl.py", line 97, in get_file
    raise Exception(message)
Exception: Failed to get_file properly for url https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input, remote server returned status code of 416.
2018-10-24 17:54:54,282 INFO  [pulsar.client.transport.curl][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] transfer of https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input will resume at 1519080 bytes
2018-10-24 17:54:54,587 INFO  [pulsar.managers.util.retry][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] Failed to execute action[Staging input 'dataset_21527052.dat' via FileAction[url=https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input path=/galaxy-repl/main/files/021/527/dataset_21527052.dataction_type=remote_transfer] to /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat], retrying in 4.0 seconds.
...

It retries until it times out but the size stays at 1519080, which is also true on disk:

$ ls -l /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat
-rw-------. 1 g2main G-803372 1519080 Oct 24 17:54 /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat

The extra bytes are these at the end file (output from nginx):

<html>
<head><title>416 Requested Range Not Satisfiable</title></head>
<body bgcolor="white">
<center><h1>416 Requested Range Not Satisfiable</h1></center>
<hr><center>nginx/1.12.2</center>
</body>
</html>

@jmchilton any idea what's going on here? If nothing else we should probably have staging not write to the file if the response is not in the 200 range.

jennaj commented 5 years ago

I can reproduce the problem with both tutorial data and with a user's full-size data. I can share that internally to keep it private, lmk.

I am going to run the tutorial data at EU and see what happens there. I'll also download/upload the tutorial inputs and rerun against all clusters (including roundup this time).

Not quite sure of scope -- if really just main or certain cluster or the wrapper or is Galaxy itself.

natefoo commented 5 years ago

If you're seeing the same behavior with completely unrelated datasets I wouldn't worry about spending the time to reupload and retest. As long as it works whenever the default job resources are used this isn't showstopping right? I can remove the resource selector from this tool until we sort it out (although I can't imagine how it's tool specific).

jennaj commented 5 years ago

@natefoo I think that would be a good idea (remove the resource selector). If the jobs only succeed on roundup (true for test data and user data), we can limit the execution to that cluster and figure out the rest later (if needed). I'd like to leave this ticket open but track that it is a lower priority so we don't lose it. Can track in project-1 (this repo) in a distinct list.

jennaj commented 5 years ago

@natefoo following up .. the resource selector is still on the tool. Should I retest on the other clusters? Might work now, still have the test history. Lmk

natefoo commented 5 years ago

Yeah, might as well try a retest.

jennaj commented 5 years ago

Ok, queued to run. Will check back and close this out if successful.

jennaj commented 5 years ago

Looks to still be failing at jetstream/stampede. Roundup still queued, expect it to work there. Upgraded to the latest version of the tool just to make sure that wasn't a factor. Guessing we want to strip out the job resource selector?

Screen Shot 2019-04-22 at 2 14 06 PM
natefoo commented 5 years ago

A comment from @jennaj about the same error with another tool where the user had erroneously selected the same input twice led me to what I suspect is the answer here: If a job has any duplicated inputs, Pulsar tries to stage them twice and fails. I checked a few of the jobs in Jen's test history and they use the same inputs twice (correctly I assume).

@jmchilton any chance I can get you to take a look? I can get to it in a bit if you are too busy to look at it though.

jmchilton commented 5 years ago

I'll resolve this this week, I'll self assign as my weekly main bug.

jmchilton commented 5 years ago

xref https://github.com/galaxyproject/pulsar/pull/175

jennaj commented 5 years ago

Retested, same history, same failures. Fix not on main yet.