Closed AngledLuffa closed 1 year ago
It did, however, work in a clean checkout of GUMReddit
's main branch. Accordingly, I will just overwrite my local copy with the git version.
Hmm, could this be a permissions issue or something? I have no involvement in the production of the full UD 2.11 download, that's maybe a question for @dan-zeman. But I can confirm that the script as found in the github repo should and does work. If you figure out why it doesn't work from the full UD download and it's fixable via code that I control, then I'm happy to fix. it of course.
The file get_text.py
in my working copy of the Github repository is identical with the copy in the release package, and has identical permissions (-rw-r--r--
). However, the script seems to expect the existence of the not-to-release
folder:
grep -n not-to-release get_text.py 491: files = glob("not-to-release" + os.sep + "sources" + os.sep + "*.conllu")
I hope it is not surprising that this folder is not present in the release package :-)
I see... yes, that makes sense. Well, naturally users of the GH repo would expect those individual document files to also get reconstructed, right?
What I can do is simply check whether the folder exists and generate it if it doesn't, then split the big files into the documents. The script first reconstructs the documents before reconstituting the train/dev/test sets, so it does need those files to exist. I can take care of this.
Thanks for digging into this!
OK, should be done in dev, will be merged automatically to master on next release.
On a clean download of UD 2.11, I went to the UD_English-GUMReddit subdirectory and ran
get_text.py
. I encountered the following: