Closed kakeith closed 5 years ago
Thanks for the feedback -- we'll update/improve the documentation to describe how to create arrays from tf_record objects. As for your question, you should call the script from the src folder directly as:
python -m PeerRead.dataset.array_from_dataset
We've updated the readme to clarify this now.
Note that this file creates tsv files to use as input to our baselines. In particular, the output is already tokenized and cleaned. Depending on your use case (i.e., if you want to do different pre-processing) you might be better off downloading the original PeerRead data and adapting our data cleaning scripts.
Great! Thank you so much for the quick response time!
I've been trying (unsuccessfully) for the last hour or so to transform your
dat/PeerRead/proc/arxiv-all.tf_record
into numpy arrays that one can actually read.(1) I don't think your ReadMe currently points to your
array_from_dataset.py
file which I had to dig around for awhile to find. (2) When I runpython array_from_dataset.py
fromsrc/PeerRead/dataset/
folder I get errors likeModuleNotFoundError: No module named 'bert'
which makes me think I'm running it from the wrong folder. Where should I be running this from or what command should I use?Do mind adding some documentation or cleaning up the code to make it easier to transform .tf_record into numpy arrays? Thanks!