[BUG] 0.9.2 seems to not quite handle 206's properly

palfrey commented 8 months ago

Describe the bug

I haven't seen this on 0.9.1, but now seeing on 0.9.2. For some sites, especially https://www.theguardian.com/ I'm getting logs like newspaper.network:network.py:192 get_html_status(): bad status code 206 on URL and it looks like basically a 206 is getting hit with a gzip-encoded response and it's not going back and pulling the rest of the content.

Sometimes I get a newspaper.exceptions.ArticleBinaryDataException when it's clearly not an actual binary page, it's just a partially retrieved page that's failing zlib stuff because it's only got half the page.

To Reproduce Annoyingly, I don't have an easy repo of this. I've got it semi-reliable, but only within a large test sequence with pytest and VCR.py, and I'm trying to get something more reliable that's a single file I can provide.

Expected behavior Download just works with gzip-encoded pages, even if they do a 206 part way through.

System information

OS: Linux
Python version: 3.11.5
Library version: 0.9.2

AndyTheFactory commented 7 months ago

Hi i was not so familiar with the way 206 is used and implemented, had to research a little bit.

yes, it's hard to find a test case. that is why i created quickly a simple server that delivers 206 responses

Gzip encoded and Ranged server

I tested it with a cnn article (cnn_article.html from tests data)

the Warning logged should not influence the result. The only problem is that you will get only the partial response parsed. If the Html is too aggressive cut (for instance if you set the limit to 10000 for the above mentioned article), you will not get any useful text.

There is the question why would 206 appear on non-binary content. As far as i could test, browsers do not request the rest of the partial content if they get 206 on the main html page. sure, streaming resources maybe. did not test

Regarding gzip encoded, it's weird that they would split the content after gzip, I did not find any references to such a practice .. What i found was that the chunk is gziped and sent. what you encounter seems to be rather an network error? could it be?

you can play around with the server and maybe you can simulate a case that is similar to what you encountered in the wild.

palfrey commented 7 months ago

I haven't been able to fully reproduce this one. What I do have is a a VCR.py cassette (https://gist.github.com/palfrey/f8556218fe86e57c1f507b8d65a3e311) that got recorded and then caused issues. Note that it somehow has both a GET with the 206 response and a partial data bit and another GET for the same URL with partial data. I have no idea what's causing that, but deleting the 206 responses from my stored data seems to solve things, and AFAIK this is only occurring in the test scenarios not prod, so it might be a vcr.py issue...

AndyTheFactory commented 7 months ago

ok, let's keep an eye on it. I will release 0.9.3 without any extra changes to address this. Working now on the last touch-ups

AndyTheFactory / newspaper4k

[BUG] 0.9.2 seems to not quite handle 206's properly #617