HHS / meshrdf

Code and documentation for the release of MeSH in RDF format
https://hhs.github.io/meshrdf/
75 stars 19 forks source link

RDF mesh2021.nt.gz HTTPS download compression #193

Closed dhimmel closed 2 years ago

dhimmel commented 2 years ago

The *.nt.gz RDF downloads served from HTTPS URLs (rather than FTP) seem to selectively return gzip compressed data and plain text.

Example URL at https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2021/mesh2021.nt.gz, but this seems to affect the .nt.gz files for all years. When I open mesh2021.nt.gz in my browser or with Python's fsspec.open, I'm getting plain text.

# this command return gzipped data
$ curl --silent https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2021/mesh2021.nt.gz | head --bytes=50
��amesh2021.nt���亶����)�kq%O �ѡ����

# this command returns plain text data
$ curl --silent 'https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2021/mesh2021.nt.gz' \
  --compressed | head --bytes=50
<http://id.nlm.nih.gov/mesh/2021/A01.111> <http://

Here's the docs for curl --compressed:

(HTTP) Request a compressed response using one of the algorithms libcurl supports, and save the uncompressed document. If this option is used and the server sends an unsupported encoding, curl will report an error.

Opening this mostly as an informational issue in case anyone else hits this. Will follow up with a solution.

dhimmel commented 2 years ago

Plain text is also being returned by Python requests:

>>> import requests
>>> response = requests.get("https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2021/mesh2021.nt.gz")
>>> response.content[:30]
b'<http://id.nlm.nih.gov/mesh/20'
>>> response.headers["Content-Encoding"]
'gzip'

I think the Content-Encoding header specifying gzip is causing requests and other libraries to decompress the response content, such that the user receives plain text.

From the requests docs:

The gzip and deflate transfer-encodings are automatically decoded for you.

danizen commented 2 years ago

When accessing a larger file online through requests, consider using the "stream=True" argument as described at https://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow.

This also mentions the difference between response.content and response.raw:

Alternatively, you can read the undecoded body from the underlying urllib3 urllib3.HTTPResponse at Response.raw.

dhimmel commented 2 years ago

When accessing a larger file online through requests, consider using the "stream=True" argument

Thanks @danizen for the tip! I ended up trying requests.get with stream=True and response.raw.decode_content = True, but ended up hitting an error on CI.

The thing is that the Python rdflib can take 30 minutes to to load mesh2021.nt.gz, which is a long time to keep the connection open. Anyways, I thought it just made sense to download the file locally to a temporary directory and then read it in from there to avoid the long-lived connection. Changed in https://github.com/related-sciences/nxontology-data/pull/1/commits/01c8fea202686802f72a3f3df8367e95e6f5b97c.

danizen commented 2 years ago

Makes sense. Python is very efficient for writing code quickly, and then running it slowly :). That said, I write a lot of it.