setup.sh fails when untarring cnn_stories.tgz

JohnGiorgi commented 2 years ago

Hi,

I am trying to re-create the data under the data subdirectory by following the instructions. With a docker deamon running, I setup as follows:

# Set up the environment
pyenv install miniconda3-3.7-4.12.0
conda create -n re-examining
conda activate re-examining

# Install re-examining
git clone https://github.com/CogComp/re-examining-correlations.git
cd re-examining-correlations
pip install -r requirements.txt
pip install sacrerouge
pip install repro

and then run

sh data/setup.sh

This process runs until it reaches the following step:

Untarring temp/summeval/raw/cnn_stories.tgz (it's pretty slow...)

at which point it errors out:

Traceback (most recent call last):
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/bin/sacrerouge", line 8, in <module>
    sys.exit(main())
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/__main__.py", line 8, in main
    args.func(args)
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/commands/setup_dataset.py", line 25, in run
    args.subfunc(args)
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/datasets/fabbri2020/subcommand.py", line 28, in run
    setup.setup(args.output_dir, args.force)
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/datasets/fabbri2020/setup.py", line 349, in setup
    setup_documents(cnn_tar, dailymail_tar, output_dir, force)
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/site-packages/sacrerouge/datasets/fabbri2020/setup.py", line 130, in setup_documents
    with tarfile.open(tar_path, 'r') as tar:
  File "/Users/johngiorgi/.pyenv/versions/miniconda3-3.7-4.12.0/lib/python3.7/tarfile.py", line 1580, in open
    raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully

Any ideas why the untarring cnn_stories would fail? The issue appears to originate in sacrerouge.

danieldeutsch commented 2 years ago

Could you check to make sure that the cnn_stories file successfully downloaded? It looks like there's a problem opening it, so my guess is something is wrong with the file.

JohnGiorgi commented 2 years ago

I tried re-downloading cnn_stories.tgz by deleting temp/* and running sh data/setup.sh again, but I get the same error.

If I manually inspect the contents of the file, I can see it only managed to grab some HTML:

<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="TOeDQ4nep9OjaIugDC8ZVg">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ">cnn_stories.tgz</a> (151M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://docs.google.com/uc?export=download&amp;id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&amp;confirm=t&amp;uuid=656a37f4-386b-46c5-ab03-a42da6c349fa" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>

I don't know how to avoid this? Looks like sacrerouge.common.util.download_file_from_google_drive function is not working as expected. I see similar HTML content for dailymail_stories.tgz.

danieldeutsch commented 2 years ago

Ok, I have seen this problem before. The library I use to download from Google Drive is not reliable.

Until I have time to fix it, a workaround is to download the respective files from here and put them in the temp/summeval/raw/ directory. I believe the downloading code should see the files already exist and skip trying to download them.

JohnGiorgi commented 2 years ago

Gotcha, this gets me past the first Google Drive-related hurtle, thanks! Unfortunately the GDrive requests then start getting blocked

AssertionError: pnbert_out_lstm_pn_rl has unequal lines in its src, ref, and out files, likely because Google Drive began denying requests. Delete the bad files and rerun.

If I manually inspect the files, I can see that is the case:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>

Any idea how long to wait between runs before GDrive allows these requests to go through again?

danieldeutsch commented 2 years ago

Hopefully you were able to fix this 😬 . Downloading files from Google Drive has always been a pain point, and I don't know of any solutions to more reliably download files. One day I will get around to fixing it....

CogComp / re-examining-correlations

setup.sh fails when untarring cnn_stories.tgz #2