common-workflow-language / schema_salad

Semantic Annotations for Linked Avro Data
https://www.commonwl.org/v1.2/SchemaSalad.html
Apache License 2.0
72 stars 62 forks source link

In some scenarios `fetch_text` returns empty string #465

Open jmfernandez opened 2 years ago

jmfernandez commented 2 years ago

This issue is related to schema_salad.fetcher.DefaultFetcher.fetch_text, and it is a bit difficult to reproduce. I have found it this late afternoon, when I tried next cwltool commands:

cwltool --debug --validate https://raw.githubusercontent.com/inab/Wetlab2Variations/eosc-life/cwl-workflows/workflows/workflow.cwl
cwltool --debug --print-deps https://raw.githubusercontent.com/inab/Wetlab2Variations/eosc-life/cwl-workflows/workflows/workflow.cwl

The issue is difficult to reproduce as it has appeared for me only from wireless. But I was unable to reproduce it from my workstation, for instance. When it fails the output is similar to next, but the intermediate stack trace is variable:

(.37) jmfernandez@olympo[10]:/tmp/ik> cwltool --debug --validate https://raw.githubusercontent.com/inab/Wetlab2Variations/eosc-life/cwl-workflows/workflows/workflow.cwl
INFO /tmp/ik/.37/bin/cwltool 3.1.20211020155521
ERROR I'm sorry, I couldn't load this CWL file.
The error was: 
Traceback (most recent call last):
  File "/tmp/ik/.37/lib/python3.7/site-packages/cwltool/main.py", line 1067, in main
    tool = make_tool(uri, loadingContext)
  File "/tmp/ik/.37/lib/python3.7/site-packages/cwltool/load_tool.py", line 470, in make_tool
    tool = loadingContext.construct_tool_object(processobj, loadingContext)
  File "/tmp/ik/.37/lib/python3.7/site-packages/cwltool/workflow.py", line 51, in default_make_tool
    return Workflow(toolpath_object, loadingContext)
  File "/tmp/ik/.37/lib/python3.7/site-packages/cwltool/workflow.py", line 106, in __init__
    step, index, loadingContext, loadingContext.prov_obj
  File "/tmp/ik/.37/lib/python3.7/site-packages/cwltool/workflow.py", line 150, in make_workflow_step
    return WorkflowStep(toolpath_object, pos, loadingContext, parentworkflowProv)
  File "/tmp/ik/.37/lib/python3.7/site-packages/cwltool/workflow.py", line 234, in __init__
    self.embedded_tool = load_tool(toolpath_object["run"], loadingContext)
  File "/tmp/ik/.37/lib/python3.7/site-packages/cwltool/load_tool.py", line 486, in load_tool
    loadingContext, workflowobj, uri = fetch_document(argsworkflow, loadingContext)
  File "/tmp/ik/.37/lib/python3.7/site-packages/cwltool/load_tool.py", line 133, in fetch_document
    loadingContext.loader.fetch(fileuri, content_types=CWL_CONTENT_TYPES),
  File "schema_salad/ref_resolver.py", line 985, in fetch
StopIteration

Digging in cwltool and later in schema_salad I found that, sometimes next block returns what it is expected, sometimes it returns an empty string:

https://github.com/common-workflow-language/schema_salad/blob/5d45ae6cdc4b9896d6e305b4372cb8aef3e42238/schema_salad/fetcher.py#L59-L75

It seems some weird behavior in requests library, where it does not materialize the full response of the request when the property is read (but I'm guessing, as I'm not a frequent user of such library). I guess that because I did a quick'n'dirty test, substituting requests library calls by urllib.request ones (see code below), and it started working:

        if scheme in ["http", "https"] and self.session is not None:
            try:
                headers = {}
                if content_types:
                    headers["Accept"] = ", ".join(content_types) + ", */*;q=0.8"
                req = urllib.request.Request(url, headers=headers)
                #resp = self.session.get(url, headers=headers)
                #resp.raise_for_status()
                resp = urllib.request.urlopen(req)
            except Exception as e:
                raise ValidationException(f"Error fetching {url}: {e}") from e
            if content_types and "content-type" in resp.headers:
                content_type = resp.headers["content-type"].split(";")[:1][0]
                if content_type not in content_types:
                    _logger.warning(
                        f"While fetching {url}, got content-type of "
                        f"'{content_type}'. Expected one of {content_types}."
                    )
            raw_content = resp.read()
            resp.close()

            encoding = resp.headers.get_content_charset()
            return raw_content.decode('utf-8' if encoding is None else encoding)
            # return resp.text

Hope this helps to anyone else having this issue!

mr-c commented 2 years ago

@jmfernandez Thanks for the report, it explains some recent failures I'm seeing (also using raw.github links): https://github.com/common-workflow-language/cwltool/runs/4051426410?check_suite_focus=true#step:8:3833

Either a dependency of requests got updated, or there was a change in GitHub's infrastructure..