castorini / hedwig

PyTorch deep learning models for document classification
Apache License 2.0
593 stars 125 forks source link

Alternate hosting for hedwig-data #61

Closed achyudh closed 4 years ago

achyudh commented 4 years ago

The dataset repo https://git.uwaterloo.ca/jimmylin/hedwig-data isn't ideal and requires the user to extract the embeddings and process them with a python script. Do we have any alternate ways to host ~5 GB of data that would make it easier for others to replicate our results out of the box?

lintool commented 4 years ago

I can check in data directly into that repo for you. I think ~5GB is fine... Point me to what you want checked in.

achyudh commented 4 years ago

I just added the pre-trained BERT weights to that repo, but it's pretty slow. For instance, running git status takes a minute.

lintool commented 4 years ago

Does this use hgf? Why not do the same as here? https://huggingface.co/castorini

achyudh commented 4 years ago

It does use hgf. I guess this is something that is in the pipeline https://github.com/castorini/hedwig/issues/56 but I was looking for something more immediate

lintool commented 4 years ago

Now that you've checked it in, wgeting from https://git.uwaterloo.ca/jimmylin/hedwig-data shouldn't be too bad...

achyudh commented 4 years ago

Right now I am looking to eliminate these extra steps:

cd hedwig-data/embeddings/word2vec 
gzip -d GoogleNews-vectors-negative300.bin.gz 
python bin2txt.py GoogleNews-vectors-negative300.bin GoogleNews-vectors-negative300.txt 
achyudh commented 4 years ago

I was trying to get hedwig running and then realized that I need gensim to run bin2txt.py, a dependency that's not included in requirements.txt as hedwig itself doesn't use gensim. Would just be better if we remove this step altogether

achyudh commented 4 years ago

Fixed in #62