IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 46 forks source link

Timeout after 5 retries on error 400 #38

Closed lrrosa closed 4 years ago

lrrosa commented 4 years ago

Archiver is trying to download a file that Yahoo just shows "Found malware in the request data". After 5 retries on error 400 it crashes instead of continuing to the next file.

Traceback (most recent call last): File "./yahoo.py", line 622, in archive_files(yga) File "./yahoo.py", line 227, in archive_files archive_files(yga, subdir=pathURI) File "./yahoo.py", line 220, in archive_files yga.download_file(path['downloadURL'], f) File "/root/yahoo-group-archiver/yahoogroupsapi.py", line 64, in download_file r.raise_for_status() File "/usr/lib/python2.7/site-packages/requests/models.py", line 844, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://xa.yimg.com/df/pc-old/ddragon.zip?token=

d235j commented 4 years ago

Just ran into this — we need to handle it.

lrrosa commented 4 years ago

Still having the same issue even after the change from get_file() to download_file().

Traceback (most recent call last): File "./yahoo.py", line 637, in archive_files(yga) File "./yahoo.py", line 227, in archive_files archive_files(yga, subdir=pathURI) File "./yahoo.py", line 220, in archive_files yga.download_file(path['downloadURL'], f) File "/root/yahoo-group-archiver/yahoogroupsapi.py", line 79, in download_file r.raise_for_status() File "/usr/lib/python2.7/site-packages/requests/models.py", line 844, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://xa.yimg.com/df/pc-old/ddragon.zip?token=

dossy commented 4 years ago

@d235j @lrrosa - check my dossy branch in my dossy/yahoo-group-archiver fork, where I've implemented skipping file downloads where Yahoo! returns "Found malware in the request data" and skips the file and continues going.

IgnoredAmbience commented 4 years ago

Thanks, I'll cherry pick that out into master. I've been finding it awkward to differentiate the different cases that 400 can be returned.

jmeile commented 4 years ago

Hi

I fixed this as follows:

Put this at the beginning of: yahoogroupsapi.py from urlparse import urlparse

Then go to the function: "download_file" and replace it by:

    def download_file(self, url, f=None, **args):
        with self.http_context(self.ww):
            retries = 5
            malware_check = False
            while True:
                r = self.s.get(url, stream=True, verify=VERIFY_HTTPS, **args)
                if r.status_code == 400 and retries > 0:
                    self.logger.info("Got 400 error for %s, will sleep and retry %d times", url, retries)
                    if not malware_check:
                        malware_check = True
                        if ('malware' in r.text):
                            self.logger.warning("Malware was found in file: %s, aborting", url)
                            if f <> None:
                                file_name = os.path.basename(urlparse(url).path)
                                full_path = os.path.join(os.getcwd(), file_name)
                                self.logger.warning("Deleting downloaded file: %s", full_path)
                                try:
                                    f.close()
                                    os.remove(full_path)
                                except Exception as excep:
                                    #Just in case. It is not really a big problem. The file will be empty anyhow
                                    self.logger.warning("Failed: deleting donwloaded file: %s", full_path)
                                    self.logger.warning("Exception: %s", str(excep))
                            return
                    retries -= 1
                    time.sleep(5)
                    continue
                r.raise_for_status()
                break

            if f is None:
                return r.content

            for chunk in r.iter_content(chunk_size=4096):
                f.write(chunk)

Please note that I used this:

                    if not malware_check:
                        malware_check = True
                        if ('malware' in r.text):

instead of this:

                    if not malware_check and 'malware' in r.text:
                        malware_check = True

Because I don't want that the script searches for 'malware' each time that a 400 code is returned. This may slower things.

I added a pull request: https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/74

Best regards Josef

jmeile commented 4 years ago

Sorry, I did a mistake on the initial pull request. The following line: f.close()

was commented. It is needed because under Windows, if you try to delete an opened file, then it will fail.

marked commented 4 years ago

the malware error message should be a fixed byte length, so this could be used to only string search responses of a matching size.

joerg-knitter commented 4 years ago

@jmeile Did you recognized that there are failed checks by the flak8 linter on python > 3.5 in line 84, informing in the 2.7 test that "<>" is deprecated and should be replaced by "!=" ? ("if f <> None:") See https://github.com/IgnoredAmbience/yahoo-group-archiver/pull/74/checks?check_run_id=276458699#step:5:0

Nevertheless, thanks for your patch. I hope that it gets merged soon because I encounter exactly this error message, too.

jmeile commented 4 years ago

Ok, thanks for the hint. Long time I didn't use python :-).

IgnoredAmbience commented 4 years ago

Another variation on this bug:

Traceback (most recent call last):
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 425, in _error_catcher
    yield
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 507, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 457, in read
    n = self.readinto(b)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 501, in readinto
    n = self.fp.readinto(b)
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 56, in readinto
    self.recorder.write_response(buff)
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 167, in write_response
    self.response_out.write(buff)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 764, in write
    rv = file.write(s)
ValueError: I/O operation on closed file.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "yahoo.py", line 784, in <module>
    archive_files(yga)
  File "yahoo.py", line 258, in archive_files
    yga.download_file(path['downloadURL'], f)
  File "c:\root\repositories\yahoo-group-archiver\yahoogroupsapi.py", line 90, in download_file
    for chunk in r.iter_content(chunk_size=4096):
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\models.py", line 750, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 564, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\response.py", line 456, in _error_catcher
    self._original_response.close()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 418, in close
    self._close_conn()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 411, in _close_conn
    fp.close()
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 65, in close
    self.recorder.done()
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 181, in done
    request = self._create_record(self.request_out, 'request')
  File "C:\Users\name\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\warcio\capture_http.py", line 170, in _create_record
    length = out.tell()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\tempfile.py", line 752, in tell
    return self._file.tell()
ValueError: I/O operation on closed file.