common-workflow-language / cwltool

Common Workflow Language reference implementation
https://cwltool.readthedocs.io/
Apache License 2.0
332 stars 230 forks source link

downloads remote inputs via HTTP(S) #466

Closed mr-c closed 7 years ago

mr-c commented 7 years ago

Expected Behavior

a URI should be accepted for inputs with type: File

http://www.commonwl.org/v1.0/CommandLineTool.html#File

Actual Behavior

No such file or directory: '/home/michael/src/2017-cloud-workflows-misc/http://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam'

Workflow Code

cwltool https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl \
 --bam_input https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam

Full Traceback

/home/michael/src/2017-cloud-workflows-misc/env/bin/cwltool 1.0.20170712193248
https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl:3:1: unrecognized extension field `http://purl.org/dc/terms/creator`.  Did you include a $schemas section?
[job Dockstore.cwl] initializing from https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl
[job Dockstore.cwl] {
    "bam_input": {
        "class": "File", 
        "location": "https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam", 
        "basename": "rna.SRR948778.bam", 
        "nameroot": "rna.SRR948778", 
        "nameext": ".bam"
    }, 
    "mem_gb": 0
}
Got workflow error
Traceback (most recent call last):
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/main.py", line 270, in single_job_executor
    for r in jobiter:
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/draft2tool.py", line 323, in job
    builder.pathmapper = self.makePathMapper(reffiles, builder.stagedir, **make_path_mapper_kwargs)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/draft2tool.py", line 204, in makePathMapper
    return PathMapper(reffiles, kwargs["basedir"], stagedir)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 180, in __init__
    self.setup(dedup(referenced_files), basedir)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 228, in setup
    self.visit(fob, stagedir, basedir, copy=fob.get("writable"), staged=True)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 217, in visit
    self.visitlisting(obj.get("secondaryFiles", []), stagedir, basedir, copy=copy, staged=staged)
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/schema_salad/sourceline.py", line 152, in __exit__
    raise self.makeError(six.text_type(exc_value))
ValidationException: params.yaml:3:5: [Errno 2] No such file or directory: '/home/michael/src/2017-cloud-workflows-misc/http://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam'
Workflow error, try again with --debug for more information:
params.yaml:3:5: [Errno 2] No such file or directory:
                 '/home/michael/src/2017-cloud-workflows-misc/http://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam'
Traceback (most recent call last):
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/main.py", line 886, in main
    **vars(args))
  File "/home/michael/src/2017-cloud-workflows-misc/env/local/lib/python2.7/site-packages/cwltool/main.py", line 285, in single_job_executor
    raise WorkflowException(Text(e))
WorkflowException: params.yaml:3:5: [Errno 2] No such file or directory: '/home/michael/src/2017-cloud-workflows-misc/http://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/rna.SRR948778.bam'
tetron commented 7 years ago

Strictly speaking, this isn't a regression because it was never implemented for file inputs in the first place, only for document loading. But I agree it should fetch remote http resources, the main challanege is there are some caching issues to work out if you don't want to have to pull large inputs on every run.

mr-c commented 7 years ago

@tetron Thanks for the clarification. I thought it was implemented from the beginning

denis-yuen commented 7 years ago

re: file caching

possible inspiration? https://dockstore.org/docs/advanced-features#input-file-cache

mr-c commented 7 years ago

Thank you for the pointer @denis-yuen Yes, we should reuse cwltools cachedir feature here

standage commented 7 years ago

Is this related, or should I open a separate thread?

$ cwl-runner --validate https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl
/usr/local/bin/cwl-runner 1.0.20170713151519
Tool definition failed initialization:
(u'https://github.com/CancerCollaboratory/dockstore-tool-bamstats/raw/develop/Dockstore.cwl', AttributeError("'HTTPResponse' object has no attribute 'chunked'",))
mr-c commented 7 years ago

@standage not related (and I can't reproduce with either 1.0.20170713151519 or the latest dev 1.0.20170714133745) Can you open a separate issue with the output of pip freeze?

standage commented 7 years ago

I can't reproduce either. :-)

I'll just chalk it up to transient environment config weirdness.

kapilkd13 commented 7 years ago

Hi @mr-c I am thinking of two ways to do this. in Pathmapper https://github.com/common-workflow-language/cwltool/blob/master/cwltool/pathmapper.py#L219

  1. While creating a MapperEnt object, we can download the input over http/s into a temp file and use its path as resolved path. creating something like path(httplink)->(temppath, targetPath).
  2. When creating MapperEnt object, download the http file content and assign it to resolved path and setting type to CreateFile, marking it as input on the fly. Is there a better way/position to do this? Personally, I like first one as it allows us to later implement caching over the downloaded file.
tetron commented 7 years ago

I option 1 is the right one. CreateFile is for file literals, and stores the the data directly in memory, which won't work if the data is large. For comparison, the arvados-cwl-runner does something similar, although for uploading local files to the server rather than downloading locally, but the principal is the same:

https://github.com/curoverse/arvados/blob/master/sdk/cwl/arvados_cwl/pathmapper.py#L136

kapilkd13 commented 7 years ago

@mr-c Can we close this

mr-c commented 7 years ago

Yep! To get an issue to automatically close when a PR is merged, end the Pull Request description with Closes: #NNN