coleifer / micawber

a small library for extracting rich content from urls
http://micawber.readthedocs.org/
MIT License
632 stars 91 forks source link

Make sure the whole response data is read when using urllib2.urlopen #24

Closed ahumeau closed 10 years ago

ahumeau commented 10 years ago

The read method does not guarantee to retrieve the whole data from urlopen's result. Using readlines instead will do the trick.

coleifer commented 10 years ago

Can you provide more information on why read() does not read all of the response?

coleifer commented 10 years ago

It looks like the important part of read() is implemented:

            while True:
                try:
                    data = self._sock.recv(rbufsize)
                except error, e:
                    if e.args[0] == EINTR:
                        continue
                    raise
                if not data:
                    break
                buf.write(data)
            return buf.getvalue()

I'm not seeing how this would avoid reading everything.

ahumeau commented 10 years ago

Hi, sorry, I was off the internet those past days. The only reference I found regarding this issue is this Stackoverflow entry. Yet I did experience the problem, making bootstrap_embedly fail when decoding the incomplete JSON data. Using readlines or requests "solved" the issue.

coleifer commented 10 years ago

Is there a particular site or domain that is giving you problems? Can you give me any code to try to replicate the issue? I'm definitely curious what might be going on.