Closed dhimmel closed 2 years ago
Plain text is also being returned by Python requests:
>>> import requests
>>> response = requests.get("https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2021/mesh2021.nt.gz")
>>> response.content[:30]
b'<http://id.nlm.nih.gov/mesh/20'
>>> response.headers["Content-Encoding"]
'gzip'
I think the Content-Encoding
header specifying gzip is causing requests and other libraries to decompress the response content, such that the user receives plain text.
From the requests docs:
The
gzip
anddeflate
transfer-encodings are automatically decoded for you.
When accessing a larger file online through requests, consider using the "stream=True" argument as described at https://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow.
This also mentions the difference between response.content
and response.raw
:
Alternatively, you can read the undecoded body from the underlying urllib3 urllib3.HTTPResponse at Response.raw.
When accessing a larger file online through requests, consider using the "stream=True" argument
Thanks @danizen for the tip! I ended up trying requests.get
with stream=True
and response.raw.decode_content = True
, but ended up hitting an error on CI.
The thing is that the Python rdflib can take 30 minutes to to load mesh2021.nt.gz
, which is a long time to keep the connection open. Anyways, I thought it just made sense to download the file locally to a temporary directory and then read it in from there to avoid the long-lived connection. Changed in https://github.com/related-sciences/nxontology-data/pull/1/commits/01c8fea202686802f72a3f3df8367e95e6f5b97c.
Makes sense. Python is very efficient for writing code quickly, and then running it slowly :). That said, I write a lot of it.
The
*.nt.gz
RDF downloads served from HTTPS URLs (rather than FTP) seem to selectively return gzip compressed data and plain text.Example URL at
https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2021/mesh2021.nt.gz
, but this seems to affect the.nt.gz
files for all years. When I openmesh2021.nt.gz
in my browser or with Python'sfsspec.open
, I'm getting plain text.Here's the docs for
curl --compressed
:Opening this mostly as an informational issue in case anyone else hits this. Will follow up with a solution.