Closed lrrosa closed 4 years ago
Just ran into this — we need to handle it.
Still having the same issue even after the change from get_file() to download_file().
Traceback (most recent call last):
File "./yahoo.py", line 637, in
@d235j @lrrosa - check my dossy
branch in my dossy/yahoo-group-archiver fork, where I've implemented skipping file downloads where Yahoo! returns "Found malware in the request data" and skips the file and continues going.
Thanks, I'll cherry pick that out into master. I've been finding it awkward to differentiate the different cases that 400 can be returned.
Hi
I fixed this as follows:
Put this at the beginning of: yahoogroupsapi.py
from urlparse import urlparse
Then go to the function: "download_file" and replace it by:
def download_file(self, url, f=None, **args):
with self.http_context(self.ww):
retries = 5
malware_check = False
while True:
r = self.s.get(url, stream=True, verify=VERIFY_HTTPS, **args)
if r.status_code == 400 and retries > 0:
self.logger.info("Got 400 error for %s, will sleep and retry %d times", url, retries)
if not malware_check:
malware_check = True
if ('malware' in r.text):
self.logger.warning("Malware was found in file: %s, aborting", url)
if f <> None:
file_name = os.path.basename(urlparse(url).path)
full_path = os.path.join(os.getcwd(), file_name)
self.logger.warning("Deleting downloaded file: %s", full_path)
try:
f.close()
os.remove(full_path)
except Exception as excep:
#Just in case. It is not really a big problem. The file will be empty anyhow
self.logger.warning("Failed: deleting donwloaded file: %s", full_path)
self.logger.warning("Exception: %s", str(excep))
return
retries -= 1
time.sleep(5)
continue
r.raise_for_status()
break
if f is None:
return r.content
for chunk in r.iter_content(chunk_size=4096):
f.write(chunk)
Please note that I used this:
if not malware_check:
malware_check = True
if ('malware' in r.text):
instead of this:
if not malware_check and 'malware' in r.text:
malware_check = True
Because I don't want that the script searches for 'malware' each time that a 400 code is returned. This may slower things.
I added a pull request: https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/74
Best regards Josef
Sorry, I did a mistake on the initial pull request. The following line:
f.close()
was commented. It is needed because under Windows, if you try to delete an opened file, then it will fail.
the malware error message should be a fixed byte length, so this could be used to only string search responses of a matching size.
@jmeile Did you recognized that there are failed checks by the flak8 linter on python > 3.5 in line 84, informing in the 2.7 test that "<>" is deprecated and should be replaced by "!=" ? ("if f <> None:") See https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/74/checks?check_run_id=276458699#step:5:0
Nevertheless, thanks for your patch. I hope that it gets merged soon because I encounter exactly this error message, too.
Ok, thanks for the hint. Long time I didn't use python :-).
Another variation on this bug:
Traceback (most recent call last):
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 425, in _error_catcher
yield
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 507, in read
data = self._fp.read(amt) if not fp_closed else b""
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 457, in read
n = self.readinto(b)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 501, in readinto
n = self.fp.readinto(b)
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 56, in readinto
self.recorder.write_response(buff)
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 167, in write_response
self.response_out.write(buff)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 764, in write
rv = file.write(s)
ValueError: I/O operation on closed file.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "yahoo.py", line 784, in <module>
archive_files(yga)
File "yahoo.py", line 258, in archive_files
yga.download_file(path['downloadURL'], f)
File "c:\root\repositories\yahoo-group-archiver\yahoogroupsapi.py", line 90, in download_file
for chunk in r.iter_content(chunk_size=4096):
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\models.py", line 750, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 564, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 529, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 456, in _error_catcher
self._original_response.close()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 418, in close
self._close_conn()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 411, in _close_conn
fp.close()
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 65, in close
self.recorder.done()
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 181, in done
request = self._create_record(self.request_out, 'request')
File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 170, in _create_record
length = out.tell()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 752, in tell
return self._file.tell()
ValueError: I/O operation on closed file.
Archiver is trying to download a file that Yahoo just shows "Found malware in the request data". After 5 retries on error 400 it crashes instead of continuing to the next file.
Traceback (most recent call last): File "./yahoo.py", line 622, in
archive_files(yga)
File "./yahoo.py", line 227, in archive_files
archive_files(yga, subdir=pathURI)
File "./yahoo.py", line 220, in archive_files
yga.download_file(path['downloadURL'], f)
File "/root/yahoo-group-archiver/yahoogroupsapi.py", line 64, in download_file
r.raise_for_status()
File "/usr/lib/python2.7/site-packages/requests/models.py", line 844, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://xa.yimg.com/df/pc-old/ddragon.zip?token=