Open jennaj opened 5 years ago
This will always run on the local (roundup) cluster unless explicitly instructed otherwise. Is the problem only with non-roundup destinations?
By any chance does it work with different inputs? Even trying the same inputs re-uploaded would be interesting - I can try this if you know it works with other inputs. There is something very strange going on here. The file size of the file it's choking on, at least for my test run of the tool (input dataset with hid 3, others might be problematic on other runs, however it does successfully stage some inputs before this) is 1518867 bytes:
galaxy_main=> select file_size from dataset where id = 21527052;
file_size
-----------
1518867
(1 row)
This is true on disk:
$ ls -l /galaxy-repl/main/files/021/527/dataset_21527052.dat
-rw-r--r-- 1 g2main G-803372 1518867 Oct 2 2017 /galaxy-repl/main/files/021/527/dataset_21527052.dat
And when fetched as a job input (also shows that byte-range requests are supported):
$ curl 'https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat' -v > /tmp/dataset_21527052.dat
< HTTP/1.1 200 OK
< Server: nginx/1.12.2
< Date: Wed, 24 Oct 2018 18:05:52 GMT
< Content-Type: application/json
< Content-Length: 1518867
< Last-Modified: Mon, 02 Oct 2017 20:12:01 GMT
< Connection: keep-alive
< Vary: Accept-Encoding
< cache-control: max-age=0,no-cache,no-store
< ETag: "59d29d91-172d13"
< Strict-Transport-Security: max-age=15552000; includeSubdomains
< Accept-Ranges: bytes
* Curl_http_done: called premature == 0
* Connection #0 to host galaxy-web-04.galaxyproject.org left intact
$ ls -l /tmp/dataset_21527052.dat
-rw-r--r-- 1 nate nate 1518867 Oct 24 14:05 /tmp/dataset_21527052.dat
Pulsar fetches it to the end but for some reason thinks it failed and tries to resume:
2018-10-24 17:54:51,952 DEBUG [pulsar.managers.staging.pre][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] Staging input 'dataset_21527052.dat' via FileAction[url=https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input path=/galaxy-repl/main/files/021/527/dataset_21527052.dataction_type=remote_transfer] to /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat
2018-10-24 17:54:51,952 INFO [pulsar.client.transport.curl][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] transfer of https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input will resume at 1518867 bytes
It does fetch a little more:
2018-10-24 17:54:52,168 INFO [pulsar.managers.util.retry][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] Failed to execute action[Staging input 'dataset_21527052.dat' via FileAction[url=https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input path=/galaxy-repl/main/files/021/527/dataset_21527052.dataction_type=remote_transfer] to /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat], retrying in 2.0 seconds.
Traceback (most recent call last):
File "/srv/pulsar/main/venv/lib/python2.7/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
return fun(*args, **kwargs)
File "/srv/pulsar/main/venv/lib/python2.7/site-packages/pulsar/managers/staging/pre.py", line 17, in <lambda>
action_executor.execute(lambda: action.write_to_path(path), "action[%s]" % description)
File "/srv/pulsar/main/venv/lib/python2.7/site-packages/pulsar/client/action_mapper.py", line 436, in write_to_path
get_file(self.url, path)
File "/srv/pulsar/main/venv/lib/python2.7/site-packages/pulsar/client/transport/curl.py", line 97, in get_file
raise Exception(message)
Exception: Failed to get_file properly for url https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input, remote server returned status code of 416.
2018-10-24 17:54:54,282 INFO [pulsar.client.transport.curl][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] transfer of https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input will resume at 1519080 bytes
2018-10-24 17:54:54,587 INFO [pulsar.managers.util.retry][[manager=jetstream_iu]-[action=preprocess]-[job=21099750]] Failed to execute action[Staging input 'dataset_21527052.dat' via FileAction[url=https://galaxy-web-04.galaxyproject.org/_job_files? ... &path=/galaxy-repl/main/files/021/527/dataset_21527052.dat&file_type=input path=/galaxy-repl/main/files/021/527/dataset_21527052.dataction_type=remote_transfer] to /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat], retrying in 4.0 seconds.
...
It retries until it times out but the size stays at 1519080, which is also true on disk:
$ ls -l /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat
-rw-------. 1 g2main G-803372 1519080 Oct 24 17:54 /jetstream/scratch0/main/jobs/21099750/inputs/dataset_21527052.dat
The extra bytes are these at the end file (output from nginx):
<html>
<head><title>416 Requested Range Not Satisfiable</title></head>
<body bgcolor="white">
<center><h1>416 Requested Range Not Satisfiable</h1></center>
<hr><center>nginx/1.12.2</center>
</body>
</html>
@jmchilton any idea what's going on here? If nothing else we should probably have staging not write to the file if the response is not in the 200 range.
I can reproduce the problem with both tutorial data and with a user's full-size data. I can share that internally to keep it private, lmk.
I am going to run the tutorial data at EU and see what happens there. I'll also download/upload the tutorial inputs and rerun against all clusters (including roundup this time).
Not quite sure of scope -- if really just main or certain cluster or the wrapper or is Galaxy itself.
If you're seeing the same behavior with completely unrelated datasets I wouldn't worry about spending the time to reupload and retest. As long as it works whenever the default job resources are used this isn't showstopping right? I can remove the resource selector from this tool until we sort it out (although I can't imagine how it's tool specific).
@natefoo I think that would be a good idea (remove the resource selector). If the jobs only succeed on roundup (true for test data and user data), we can limit the execution to that cluster and figure out the rest later (if needed). I'd like to leave this ticket open but track that it is a lower priority so we don't lose it. Can track in project-1 (this repo) in a distinct list.
@natefoo following up .. the resource selector is still on the tool. Should I retest on the other clusters? Might work now, still have the test history. Lmk
Yeah, might as well try a retest.
Ok, queued to run. Will check back and close this out if successful.
Looks to still be failing at jetstream/stampede. Roundup still queued, expect it to work there. Upgraded to the latest version of the tool just to make sure that wasn't a factor. Guessing we want to strip out the job resource selector?
A comment from @jennaj about the same error with another tool where the user had erroneously selected the same input twice led me to what I suspect is the answer here: If a job has any duplicated inputs, Pulsar tries to stage them twice and fails. I checked a few of the jobs in Jen's test history and they use the same inputs twice (correctly I assume).
@jmchilton any chance I can get you to take a look? I can get to it in a bit if you are too busy to look at it though.
I'll resolve this this week, I'll self assign as my weekly main bug.
Retested, same history, same failures. Fix not on main yet.
Failure is always
(33, "HTTP server doesn't seem to support byte ranges. Cannot resume.")
paired withRemote job server indicated a problem running or monitoring this job.
.Test history that contains prior successful runs. The other earlier errors in the history were tests to see how the wrapper worked, none are like this error and the usage issue to fix those is (was!) known. Uses tutorial data. https://usegalaxy.org:/u/jen/h/differential-exon-usage-tutorial-starting-data-plus-dexseq
Galaxy Tool ID: toolshed.g2.bx.psu.edu/repos/iuc/dexseq/dexseq/1.24.0.0
@natefoo @davebx This is new but has come up in the past in a few bug reports (across tools). Seems like always a server side issue.