Open JohnGiorgi opened 2 years ago
Could you check to make sure that the cnn_stories file successfully downloaded? It looks like there's a problem opening it, so my guess is something is wrong with the file.
I tried re-downloading cnn_stories.tgz
by deleting temp/*
and running sh data/setup.sh
again, but I get the same error.
If I manually inspect the contents of the file, I can see it only managed to grab some HTML:
<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="TOeDQ4nep9OjaIugDC8ZVg">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ">cnn_stories.tgz</a> (151M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://docs.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&confirm=t&uuid=656a37f4-386b-46c5-ab03-a42da6c349fa" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>
I don't know how to avoid this? Looks like sacrerouge.common.util.download_file_from_google_drive
function is not working as expected. I see similar HTML content for dailymail_stories.tgz
.
Ok, I have seen this problem before. The library I use to download from Google Drive is not reliable.
Until I have time to fix it, a workaround is to download the respective files from here and put them in the temp/summeval/raw/
directory. I believe the downloading code should see the files already exist and skip trying to download them.
Gotcha, this gets me past the first Google Drive-related hurtle, thanks! Unfortunately the GDrive requests then start getting blocked
AssertionError: pnbert_out_lstm_pn_rl has unequal lines in its src, ref, and out files, likely because Google Drive began denying requests. Delete the bad files and rerun.
If I manually inspect the files, I can see that is the case:
<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>
Any idea how long to wait between runs before GDrive allows these requests to go through again?
Hopefully you were able to fix this 😬 . Downloading files from Google Drive has always been a pain point, and I don't know of any solutions to more reliably download files. One day I will get around to fixing it....
Hi,
I am trying to re-create the data under the
data
subdirectory by following the instructions. With a docker deamon running, I setup as follows:and then run
This process runs until it reaches the following step:
at which point it errors out:
Any ideas why the untarring cnn_stories would fail? The issue appears to originate in
sacrerouge
.