Add Stanford GloVe Embeddings Datasets

jonthegeek commented 5 years ago

I'd like to add the GloVe pre-trained word vectors, for use in https://github.com/tidymodels/textrecipes/issues/20

The datasets are available here: https://nlp.stanford.edu/projects/glove/

There are 4 downloads, that break down like this:

glove.6B.zip = 4 datasets
glove.42B.300d.zip = 1 dataset
glove.840B.300d.zip = 1 dataset
glove.twitter.27B.zip = 4 datasets

The first one is all I'm directly in need of right now, but it feels worthwhile to work out a standard for all of them while I'm at it.

I don't want to make the functions too complicated to understand, but it feels like maybe it should be one set of textdata functions (download_glove, process_glove, dataset_glove), with arguments about the specifics (something like dataset_glove({normal stuff plus}, token_set, dimensions)).

Let me know what you think and I can knock this out (I'm doing it anyway for personal/work use, so formalizing it won't be a lot of extra work).

EmilHvitfeldt commented 5 years ago

This sounds good.

It looks like each download comes with everything zipped. So I would create 4 user facing functions. Lets prefix them with embedding_ . so we get embedding_glove6b(), embedding_glove42b() etc etc.

I did a little writeup of what should be done to make a new step work: https://emilhvitfeldt.github.io/textdata/articles/How-to-add-a-data-set.html

If you need examples of how this procedure works look at this commit https://github.com/EmilHvitfeldt/textdata/commit/7ce4e422f44d90d681860ad0841b385a990e9628.

Please feel free to ping me if you have any questions or problems

jonthegeek commented 5 years ago

Ok, that sounds good. The downloads will be separate, but then I'll put a parameter in the dataset_ function to just load the appropriate sub-dataset (for 6b and 27b). I should have a PR for this within the next couple hours, depending on what other distractions come up.

EmilHvitfeldt / textdata

Add Stanford GloVe Embeddings Datasets #26