KrishnaswamyLab / scprep

A collection of scripts and tools for loading, processing, and handling single cell data.
MIT License
72 stars 19 forks source link

`scprep.io.download.download_google_drive` failing due to google api changes #132

Open atong01 opened 2 years ago

atong01 commented 2 years ago

A report by Erica in the Help Slack shows that the scprep google drive downloads are breaking in the workshop notebooks. Example:

# download the data from Google Drive
scprep.io.download.download_google_drive("1QGkqL_FF7iveR1TLZ8HJKBANOmugBxlm",
                                         "retinal_bipolar.zip")
scprep.io.download.unzip("retinal_bipolar.zip")

Fails with

---------------------------------------------------------------------------
BadZipFile                                Traceback (most recent call last)
[<ipython-input-3-f00e6d2b86fe>](https://gfr362bpfk-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220623-060059-RC00_456727846#) in <module>()
      2 scprep.io.download.download_google_drive("1QGkqL_FF7iveR1TLZ8HJKBANOmugBxlm",
      3                                          "retinal_bipolar.zip")
----> 4 scprep.io.download.unzip("retinal_bipolar.zip")

2 frames
[/usr/lib/python3.7/zipfile.py](https://gfr362bpfk-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220623-060059-RC00_456727846#) in _RealGetContents(self)
   1323             raise BadZipFile("File is not a zip file")
   1324         if not endrec:
-> 1325             raise BadZipFile("File is not a zip file")
   1326         if self.debug > 1:
   1327             print(endrec)

BadZipFile: File is not a zip file

Examining a bit more, we find that the "zip file" downloaded is not a zip file, but a virus scan warning HTML.

!cat retinal_bipolar.zip

<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="xG6PSW5r7g0D-MRjK19yow">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=1QGkqL_FF7iveR1TLZ8HJKBANOmugBxlm">retinal_bipolar.zip</a> (92M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://docs.google.com/uc?export=download&amp;id=1QGkqL_FF7iveR1TLZ8HJKBANOmugBxlm&amp;confirm=t" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>

A quick stack overflow search reveals that there may be a change in the google drive API.

seems like something has changed behind the scenes and the token stuff does not quite work anymore. However, simply always including confirm=1 as parameter seems to be a workaround. – 
[Mr Tsjolder](https://stackoverflow.com/users/4375377/mr-tsjolder)
[Apr 8 at 14:11](https://stackoverflow.com/questions/38511444/python-download-files-from-google-drive-using-url#comment126879887_39225272)

With a deeper examination, the first response we get no longer has cookies and hence we get confirm=None. I suspect this passes current tests as the test file is small enough for google to scan for viruses.

Changing the above to the following works. This skips the initial request and just substitutes "confirm=1". However, this may have unintended consequences as we no longer check for confirmation. Perhaps there is a better solution.

# download the data from Google Drive
import requests
_GOOGLE_DRIVE_URL = "https://docs.google.com/uc?export=download"
_CHUNK_SIZE = 32768
def _GET_google_drive(id):
    with requests.Session() as session:
        params = {"id": id, "confirm": 1}
        response = session.get(_GOOGLE_DRIVE_URL, params=params, stream=True)
    return response
def _save_response_content(response, destination):
    global _CHUNK_SIZE
    if isinstance(destination, str):
        with open(destination, "wb") as handle:
            _save_response_content(response, handle)
    else:
        for chunk in response.iter_content(_CHUNK_SIZE):
            if chunk:  # filter out keep-alive new chunks
                destination.write(chunk)
def download_google_drive(id, destination):
    response = _GET_google_drive(id)
    _save_response_content(response, destination)
download_google_drive("1QGkqL_FF7iveR1TLZ8HJKBANOmugBxlm", "retinal_bipolar.zip")
scprep.io.download.unzip("retinal_bipolar.zip")