Get numpy arrays out of .tf_record

kakeith commented 5 years ago

I've been trying (unsuccessfully) for the last hour or so to transform your dat/PeerRead/proc/arxiv-all.tf_record into numpy arrays that one can actually read.

(1) I don't think your ReadMe currently points to your array_from_dataset.py file which I had to dig around for awhile to find. (2) When I run python array_from_dataset.py from src/PeerRead/dataset/ folder I get errors like ModuleNotFoundError: No module named 'bert' which makes me think I'm running it from the wrong folder. Where should I be running this from or what command should I use?

Do mind adding some documentation or cleaning up the code to make it easier to transform .tf_record into numpy arrays? Thanks!

dsridhar91 commented 5 years ago

Thanks for the feedback -- we'll update/improve the documentation to describe how to create arrays from tf_record objects. As for your question, you should call the script from the src folder directly as: python -m PeerRead.dataset.array_from_dataset

vveitch commented 5 years ago

We've updated the readme to clarify this now.

Note that this file creates tsv files to use as input to our baselines. In particular, the output is already tokenized and cleaned. Depending on your use case (i.e., if you want to do different pre-processing) you might be better off downloading the original PeerRead data and adapting our data cleaning scripts.

kakeith commented 5 years ago

Great! Thank you so much for the quick response time!

blei-lab / causal-text-embeddings

Get numpy arrays out of .tf_record #1