UniversalDependencies / UD_English-GUMReddit

Other
1 stars 2 forks source link

New pieces of GUMReddit unavailable by proxy? #3

Closed AngledLuffa closed 1 year ago

AngledLuffa commented 1 year ago

On a clean download of UD 2.11, I went to the UD_English-GUMReddit subdirectory and ran get_text.py. I encountered the following:

[john@localhost UD_English-GUMReddit]$ python3 get_text.py
Missing praw credentials detected! You cannot download reddit data using praw.
Can't find Google BigQuery json key file. You cannot download reddit data using bigquery
Missing access to bigquery and/or praw.
Do you want to try downloading reddit data from an available server?
Confirm: you are solely responsible for downloading reddit data and may only use it for non-commercial purposes:
[Y]es/[N]o> Y
Retrieving reddit data by proxy...
ERR: Missing document GUM_reddit_macroeconomics
ERR: Missing document GUM_reddit_pandas
ERR: Missing document GUM_reddit_escape
ERR: Missing document GUM_reddit_monsters
AngledLuffa commented 1 year ago

It did, however, work in a clean checkout of GUMReddit's main branch. Accordingly, I will just overwrite my local copy with the git version.

amir-zeldes commented 1 year ago

Hmm, could this be a permissions issue or something? I have no involvement in the production of the full UD 2.11 download, that's maybe a question for @dan-zeman. But I can confirm that the script as found in the github repo should and does work. If you figure out why it doesn't work from the full UD download and it's fixable via code that I control, then I'm happy to fix. it of course.

dan-zeman commented 1 year ago

The file get_text.py in my working copy of the Github repository is identical with the copy in the release package, and has identical permissions (-rw-r--r--). However, the script seems to expect the existence of the not-to-release folder:

grep -n not-to-release get_text.py
491:    files = glob("not-to-release" + os.sep + "sources" + os.sep + "*.conllu")

I hope it is not surprising that this folder is not present in the release package :-)

amir-zeldes commented 1 year ago

I see... yes, that makes sense. Well, naturally users of the GH repo would expect those individual document files to also get reconstructed, right?

What I can do is simply check whether the folder exists and generate it if it doesn't, then split the big files into the documents. The script first reconstructs the documents before reconstituting the train/dev/test sets, so it does need those files to exist. I can take care of this.

AngledLuffa commented 1 year ago

Thanks for digging into this!

amir-zeldes commented 1 year ago

OK, should be done in dev, will be merged automatically to master on next release.