Closed achyudh closed 4 years ago
I can check in data directly into that repo for you. I think ~5GB is fine... Point me to what you want checked in.
I just added the pre-trained BERT weights to that repo, but it's pretty slow. For instance, running git status
takes a minute.
Does this use hgf? Why not do the same as here? https://huggingface.co/castorini
It does use hgf. I guess this is something that is in the pipeline https://github.com/castorini/hedwig/issues/56 but I was looking for something more immediate
Now that you've checked it in, wget
ing from https://git.uwaterloo.ca/jimmylin/hedwig-data shouldn't be too bad...
Right now I am looking to eliminate these extra steps:
cd hedwig-data/embeddings/word2vec
gzip -d GoogleNews-vectors-negative300.bin.gz
python bin2txt.py GoogleNews-vectors-negative300.bin GoogleNews-vectors-negative300.txt
I was trying to get hedwig running and then realized that I need gensim to run bin2txt.py
, a dependency that's not included in requirements.txt as hedwig itself doesn't use gensim. Would just be better if we remove this step altogether
Fixed in #62
The dataset repo https://git.uwaterloo.ca/jimmylin/hedwig-data isn't ideal and requires the user to extract the embeddings and process them with a python script. Do we have any alternate ways to host ~5 GB of data that would make it easier for others to replicate our results out of the box?