benbalter / sitemap-parser

Ruby Gem to parse sitemaps.org compliant sitemaps
MIT License
29 stars 46 forks source link

Support gzipped sitemap.xml #13

Closed okkez closed 4 years ago

benbalter commented 8 years ago

@okkez can this be accomplished natively within typhoeus?

okkez commented 8 years ago

We cannot control response using Accept-Encoding header, because some servers return always compressed response. So we need detecting response compressed or not to handle response properly. And we cannot know the response is compressed or not before fetch it.

benbalter commented 8 years ago

because some servers return always compressed response.

Does that conform to the HTTP spec? I don't think we should be optimizing the gem for non-conformant servers.

So we need detecting response compressed or not to handle response properly

Have you tried using the Typhoeus flag on a server that always sends the gzip'd response? I'd be surprised if it didn't handle both cases.

okkez commented 8 years ago

Does that conform to the HTTP spec?

Yes. For example:

$ curl -L --head https://gunosy.com/sitemaps/sitemap.xml.gz
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 58629
Content-Type: application/octet-stream
Date: Thu, 06 Oct 2016 02:53:01 GMT
ETag: "57f578b9-e505"
Last-Modified: Wed, 05 Oct 2016 22:03:37 GMT
Server: nginx/1.8.1
Connection: keep-alive

Typhoeus does not support inflating response automatically without Content-Encoding header in response. libcurl (or curl command) does not extract compressed response automatically without Content-Encoding header in response.

So we need to handle compressed response by hand.

BTW, google supports compressed response with content-encoding header.

$ curl -L --head --compressed https://www.google.com/sitemap.xml
HTTP/2 200 
vary: Accept-Encoding
content-encoding: gzip
content-type: text/xml
date: Thu, 06 Oct 2016 03:07:36 GMT
expires: Thu, 06 Oct 2016 03:07:36 GMT
cache-control: private, max-age=0
last-modified: Thu, 22 Sep 2016 17:32:46 GMT
x-content-type-options: nosniff
server: sffe
x-xss-protection: 1; mode=block
alt-svc: quic=":443"; ma=2592000; v="36,35,34,33,32"

$ curl -L --head  https://www.google.com/sitemap.xml
HTTP/2 200 
vary: Accept-Encoding
content-type: text/xml
date: Thu, 06 Oct 2016 03:07:49 GMT
expires: Thu, 06 Oct 2016 03:07:49 GMT
cache-control: private, max-age=0
last-modified: Thu, 22 Sep 2016 17:32:46 GMT
x-content-type-options: nosniff
server: sffe
x-xss-protection: 1; mode=block
alt-svc: quic=":443"; ma=2592000; v="36,35,34,33,32"
accept-ranges: none

qiita.com (Engineering tips site):

Always returns compressed response, but with Content-Encoding header.

$ curl --compressed --head -L http://qiita.com/sitemap/sitemap.xml.gz
HTTP/1.1 301 Moved Permanently
Content-Length: 178
Content-Type: text/html
Date: Thu, 06 Oct 2016 03:11:00 GMT
Location: https://qiita-cloud.s3-ap-northeast-1.amazonaws.com/sitemap/sitemap.xml.gz
Server: nginx
Connection: keep-alive

HTTP/1.1 200 OK
x-amz-id-2: ASC436iNLLKN/SAC72WvsWrWJt5cqy0Nbwt91IhVz377sO9xlv/9H7/yVn+Ycre/oGDdaqcz8pc=
x-amz-request-id: 899FFC9641592C20
Date: Thu, 06 Oct 2016 03:11:01 GMT
Last-Modified: Thu, 06 Oct 2016 00:38:42 GMT
ETag: "c33e7afc2b92b9adb3f28999158dd728"
Content-Encoding: gzip
Accept-Ranges: bytes
Content-Type: text/xml
Content-Length: 312
Server: AmazonS3

$ curl --head -L http://qiita.com/sitemap/sitemap.xml.gz
HTTP/1.1 301 Moved Permanently
Content-Length: 178
Content-Type: text/html
Date: Thu, 06 Oct 2016 03:11:05 GMT
Location: https://qiita-cloud.s3-ap-northeast-1.amazonaws.com/sitemap/sitemap.xml.gz
Server: nginx
Connection: keep-alive

HTTP/1.1 200 OK
x-amz-id-2: JHeb1BjLUTzMqrK4qOwvYT0zB2wNjKNcEwhmNj9fzSdTCZFdau9JZM0xXRpLUSlLHeak52vslUk=
x-amz-request-id: 0FF75D74EE0023B2
Date: Thu, 06 Oct 2016 03:11:06 GMT
Last-Modified: Thu, 06 Oct 2016 00:38:42 GMT
ETag: "c33e7afc2b92b9adb3f28999158dd728"
Content-Encoding: gzip
Accept-Ranges: bytes
Content-Type: text/xml
Content-Length: 312
Server: AmazonS3
benbalter commented 8 years ago

So it sounds like the issue is that the server is not setting the proper content encoding header? Is that common? Is that out of spec?

okkez commented 8 years ago

This (server response does not include ideal headers) is common and out of spec.

http://www.sitemaps.org/protocol.html does not define HTTP response header at all.

okkez commented 7 years ago

Any progress?

benbalter commented 7 years ago

@okkez The sitemap protocol describes the document, not the transport. The transport is based on the HTTP spec, which does specify how to handle encoding. If servers are ignoring request headers and sending gzip'd responses blindly, I don't think it's appropriate to make this plugin out of spec to compensate. Two wrongs don't make a right.

ikapelyukhin commented 7 years ago

The sitemap protocol says:

If you would like, you may compress your Sitemap
files using gzip to reduce your bandwidth requirement;
however the sitemap file once uncompressed must be
no larger than 50MB.

So, the protocol allows you to create a sitemap file and then to compress it -- it's not the web-server that is doing the compression. To the web-server it is an application/octet-stream file (the web-server doesn't know what's in the file and if it is encoded or not). Seems to me like this pull-request should be merged :smile:

benbalter commented 7 years ago

the protocol allows you to create a sitemap file and then to compress it

Would that not then be sitemap.xml.gz?

ikapelyukhin commented 7 years ago

@benbalter it would be. But coincidentally it would be also application/octet-stream, but a file extension check would also work, if that's what you mean.

stanger commented 6 years ago

Is there any movement towards merging this pull request?

benbalter commented 6 years ago

Is there any movement towards merging this pull request?

I'm not comfortable blindly treating any file as a gzip'd file when the server doesn't follow the HTTP protocol (the burden should be on the server, not the consumer), but if the server returns a proper gzip header, I believe Typhous should be able to handle it by simply adding compress: true to https://github.com/benbalter/sitemap-parser/blob/master/lib/sitemap-parser.rb#L8.

stanger commented 6 years ago

It turns out this fix didn't fix the issue I was having anyway. Having gzip used at the HTTP transport layer wasn't the problem, it was when the sitemaps listed in the sitemapindex are stored as gzip files on the server (see http://allrecipes.com/gsindex.xml).

dkniffin commented 4 years ago

@benbalter I know this is an old PR, but I want to chime in and advocate for this being merged. I've got a gzipped file uploaded to S3, and for whatever reason, it does not return the Content-encoding header. AFAIK, I have no control over what headers S3 returns for this (though if someone else knows better, please let me know). However, it does return the Content-Type: application/gzip header.

I tried the compress option you mentioned and it didn't work. Looking at the Typheous docs, it looks like their way of supporting gzip is via accept_encoding: "gzip" but adding that didn't work for me either. I even tried going straight to curl on the command line by running curl -sH 'Accept-Encoding: gzip' --compressed https://my-site.com/sitemap.xml.gz and that didn't work either, even though all the docs I've found on curl says it should unzip it. Just for reference, this does work curl https://my-site.com/sitemap.xml.gz | gunzip - 🤷‍♂️

However, I did have to make one tweak to this PR Zlib::Inflate.inflate(response.body) had to be changed to Zlib.gunzip(response.body). Not sure if that's something that changed in Zlib since this PR was put up or what, but it didn't work with inflate.

benbalter commented 4 years ago

Based on where the discussion ended, I believe this pull request is conflating two distinct issues:

  1. GZip as transport layer compression, which makes the HTTP stream smaller over the wire
  2. GZIP as a file format, which makes the content smaller on disk (and as a result, smaller over the wire as well).

The library may need to handle both, using the compress: true to support transport layer compression, as a nice performance bonus, and unziping the file once downloaded, to support GZip'd sitemaps on the application layer. That said, GZiping arbitrary files can pose some security risks.

That's a long way to say, if someone is interested in picking up this pull request where it left off, I'd be glad to review it, but am not able to add the feature myself at this time.

dkniffin commented 4 years ago

@benbalter I've created a new PR here: https://github.com/benbalter/sitemap-parser/pull/19