common-workflow-language / schema_salad

Semantic Annotations for Linked Avro Data
https://www.commonwl.org/v1.2/SchemaSalad.html
Apache License 2.0
72 stars 62 forks source link

Unexpected Error due to Duplicate `Content-Type` Headers in `cwltool` #747

Open suecharo opened 11 months ago

suecharo commented 11 months ago

When I run the following command with the latest version of cwltool:

$ cwltool https://sandbox.zenodo.org/record/1016630/files/trimming_and_qc.cwl --help

I encounter the error below:

While fetching https://sandbox.zenodo.org/record/1016630/files/trimming_and_qc.cwl, got content-type of 'application/octet-stream, application/octet-stream'. Expected one of ['text/plain', 'application/json', 'text/vnd.yaml', 'text/yaml', 'text/x-yaml', 'application/x-yaml', 'application/octet-stream'].

I thought this might be related to the fix I provided in the past at:

https://github.com/common-workflow-language/cwltool/pull/1622

However, upon closer inspection, I noticed that the content-type is duplicated: application/octet-stream, application/octet-stream.

I fetched the actual headers using curl, and observed:

$ curl -D - https://sandbox.zenodo.org/record/1016630/files/trimming_and_qc.cwl
...
Content-Type: application/octet-stream
...
Content-Type: application/octet-stream
...

It seems there are two Content-Type lines.

I suspect the code around: https://github.com/common-workflow-language/schema_salad/blob/e16612a7cf2d6cd9138aafdec20958452be3b611/schema_salad/fetcher.py#L75

might be related to this issue, but I'm not sure about the exact solution. Could you please look into this?

mr-c commented 11 months ago

Thanks for the report!

It seems that repeating HTTP header fields is valid.

I would rename content_type to received_content_types and also add a .split(",") to make it a list. https://github.com/common-workflow-language/schema_salad/blob/e16612a7cf2d6cd9138aafdec20958452be3b611/schema_salad/fetcher.py#L79

Then we can check if there is no intersection between the two sets/lists (content_types.isdisjoint(received_content_types) and throw the error as before if so.

suecharo commented 11 months ago

Thank you, Mr. @mr-c . Should I create the PR? (Tazro seems to want this fix done sooner rather than later.)

mr-c commented 11 months ago

@suecharo yes, that would be great. Thank you

suecharo commented 11 months ago

My Environment

Steps to Reproduce

Run the following command:

$ docker run -it --rm -v "$PWD":"$PWD" -w="$PWD" quay.io/commonwl/cwltool:3.1.20220628170238 https://zenodo.org/api/files/2422dda0-1bd9-4109-aa44-53d55fd934de/download-sra.cwl --help
INFO /usr/local/bin/cwltool 3.1
While fetching https://zenodo.org/api/files/2422dda0-1bd9-4109-aa44-53d55fd934de/download-sra.cwl, got content-type of 'application/octet-stream, application/octet-stream'. Expected one of ['text/plain', 'application/json', 'text/vnd.yaml', 'text/yaml', 'text/x-yaml', 'application/x-yaml', 'application/octet-stream']

Development and Testing

Setting up the schema_salad in a virtual environment.

# === build and install ===
$ git clone --depth 1 https://github.com/suecharo/schema_salad && cd schema_salad
$ which python3
/usr/bin/python3
$ python3 -m venv .
$ source ./bin/activate
(schema_salad) $ which python3
/home/suecharo/git/github.com/suecharo/schema_salad/bin/python3
(schema_salad) $ readlink $(which python3)
/usr/bin/python3
(schema_salad) $ which pip
/home/suecharo/git/github.com/suecharo/schema_salad/bin/pip
(schema_salad) $ readlink $(which pip)

(schema_salad) $ pip install -e .
...
Successfully installed CacheControl-0.13.1 certifi-2023.7.22 charset-normalizer-3.3.0 filelock-3.12.4 idna-3.4 importlib-resources-6.1.0 isodate-0.6.1 mistune-2.0.5 msgpack-1.0.7 mypy-extensions-1.0.0 pyparsing-3.1.1 rdflib-7.0.0 requests-2.31.0 ruamel.yaml-0.17.33 ruamel.yaml.clib-0.2.7 schema-salad-0.1.dev1258+ge16612a six-1.16.0 urllib3-2.0.6

(schema_salad) $ pip list
Package             Version              Editable project location
------------------- -------------------- ---------------------------------------------------
CacheControl        0.13.1
certifi             2023.7.22
charset-normalizer  3.3.0
filelock            3.12.4
idna                3.4
importlib-resources 6.1.0
isodate             0.6.1
mistune             2.0.5
msgpack             1.0.7
mypy-extensions     1.0.0
pip                 22.0.2
pyparsing           3.1.1
rdflib              7.0.0
requests            2.31.0
ruamel.yaml         0.17.33
ruamel.yaml.clib    0.2.7
schema-salad        0.1.dev1258+ge16612a /home/suecharo/git/github.com/suecharo/schema_salad
setuptools          59.6.0
six                 1.16.0
urllib3             2.0.6

(schema_salad) $ ls ./bin/
activate      activate.fish  csv2rdf      normalizer  pip3     python   python3.10  rdfgraphisomorphism  rdfs2dot          schema-salad-tool
activate.csh  Activate.ps1   doesitcache  pip         pip3.10  python3  rdf2dot     rdfpipe              schema-salad-doc

Installing cwltool in a virtual environment using editable schema_salad.

# cwl-utils
(schema_salad) $ git clone --depth 1 https://github.com/common-workflow-language/cwl-utils.git
(schema_salad) $ cd cwl-utils
(schema_salad) $ vim ./requirements.txt
# Edit schema_salad version to the editable one.
(schema_salad) $ pip install -e .
...
Successfully installed cwl-upgrader-1.2.9 cwl-utils-0.29 packaging-23.2

# cwltool
(schema_salad) $ git clone --depth 1 https://github.com/common-workflow-language/cwltool.git
(schema_salad) $ cd cwltool
(schema_salad) $ vim ./setup.py
# Edit schema_salad and cwl-utils version to the editable one.
(schema_salad) $ pip install -e .
...
Successfully installed argcomplete-3.1.2 coloredlogs-15.0.1 cwltool-3.1 humanfriendly-10.0 lxml-4.9.3 networkx-3.1 prov-1.5.1 psutil-5.9.5 pydot-1.4.2 python-dateutil-2.8.2 shellescape-3.8.1

(schema_salad) $ pip list
Package             Version              Editable project location
------------------- -------------------- -------------------------------------------------------------
argcomplete         3.1.2
CacheControl        0.13.1
certifi             2023.7.22
charset-normalizer  3.3.0
coloredlogs         15.0.1
cwl-upgrader        1.2.9
cwl-utils           0.29                 /home/suecharo/git/github.com/suecharo/schema_salad/cwl-utils
cwltool             3.1                  /home/suecharo/git/github.com/suecharo/schema_salad/cwltool
filelock            3.12.4
humanfriendly       10.0
idna                3.4
importlib-resources 6.1.0
isodate             0.6.1
lxml                4.9.3
mistune             2.0.5
msgpack             1.0.7
mypy-extensions     1.0.0
networkx            3.1
packaging           23.2
pip                 22.0.2
prov                1.5.1
psutil              5.9.5
pydot               1.4.2
pyparsing           3.1.1
python-dateutil     2.8.2
rdflib              7.0.0
requests            2.31.0
ruamel.yaml         0.17.33
ruamel.yaml.clib    0.2.7
schema-salad        0.1.dev1258+ge16612a /home/suecharo/git/github.com/suecharo/schema_salad
setuptools          59.6.0
shellescape         3.8.1
six                 1.16.0
urllib3             2.0.6

Before attempting to fix the issue, I ran the following command to confirm that the issue is reproducible.

(schema_salad) $ cwltool https://sandbox.zenodo.org/record/1016630/files/trimming_and_qc.cwl --help
INFO /home/suecharo/git/github.com/suecharo/schema_salad/bin/cwltool 3.1
usage: https://sandbox.zenodo.org/record/1016630/files/trimming_and_qc.cwl [-h] --fastq_1 FASTQ_1
                                                                           --fastq_2 FASTQ_2
                                                                           [--nthreads NTHREADS]
                                                                           [job_order]

The error was not reproducible. :thinking: This made me suspect that the error might be specific to the container: quay.io/commonwl/cwltool:3.1.20220628170238. And probably the container is using an older version of schema_salad.

I added print statements in the fetcher.py to further investigate:

try:
    headers = {}
    if content_types:
        headers["Accept"] = ", ".join(content_types) + ", */*;q=0.8"
    resp = self.session.get(url, headers=headers)
    resp.raise_for_status()
except Exception as e:
    raise ValidationException(f"Error fetching {url}: {e}") from e

# === added ===
print("=== resp.headers ===")
print(resp.headers)
print("=== resp.headers['content-type'] ===")
print(resp.headers["content-type"])

Then I ran the following command again:

(schema_salad) $ cwltool https://sandbox.zenodo.org/record/1016630/files/trimming_and_qc.cwl --help
INFO /home/suecharo/git/github.com/suecharo/schema_salad/bin/cwltool 3.1
=== resp.headers ===
{'Server': 'nginx', 'Date': 'Thu, 05 Oct 2023 02:33:36 GMT', 'Content-Length': '1151', 'Content-Disposition': 'attachment; filename=trimming_and_qc.cwl', 'Accept-Ranges': 'none, bytes', 'Set-Cookie': 'session=9779c6ebbc5f63d_651e2080.LCWpzVkPaLmiYEY9UkKCpqimCS8; Expires=Sun, 05-Nov-2023 02:33:36 GMT; Secure; HttpOnly; Path=/', 'OC-Checksum': 'MD5:415878c78ed8265bd7367099cf2254f7', 'Content-Security-Policy': "default-src 'none';", 'X-Content-Type-Options': 'nosniff', 'X-Download-Options': 'noopen', 'X-Permitted-Cross-Domain-Policies': 'none', 'X-Frame-Options': 'sameorigin', 'X-XSS-Protection': '1; mode=block', 'ETag': '"md5:415878c78ed8265bd7367099cf2254f7"', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '59', 'X-RateLimit-Reset': '1696473276', 'Retry-After': '59', 'Strict-Transport-Security': 'max-age=0', 'Referrer-Policy': 'strict-origin-when-cross-origin'}
=== resp.headers['content-type'] ===
ERROR I'm sorry, I couldn't load this CWL file, try again with --debug for more information.
The error was: 'content-type'

The output showed that the requests library was unable to retrieve the content-type header. :thinking:

suecharo commented 11 months ago

@mr-c , In summary, the error I encountered seems likely to be resolved by updating the cwltool container. However, upon further debugging, I noticed that the requests library isn't fetching the content-type header in such cases. Just wanted to report this to you.

mr-c commented 11 months ago

@suecharo I tested your example with the latest cwltool and schema_salad dev branches, and I get the original error that you reported. Then I tried again in a clean virtualenv and I received the new error about the missing content-type header!

Looking into it, I think that when we get a cached response the content-type header is missing. Delete the ~/.cache/salad directory and try again. This returned the original error.

suecharo commented 11 months ago

https://github.com/common-workflow-language/schema_salad/pull/754 created.