Closed jonthegeek closed 5 years ago
This sounds good.
It looks like each download comes with everything zipped. So I would create 4 user facing functions. Lets prefix them with embedding_
. so we get embedding_glove6b()
, embedding_glove42b()
etc etc.
I did a little writeup of what should be done to make a new step work: https://emilhvitfeldt.github.io/textdata/articles/How-to-add-a-data-set.html
If you need examples of how this procedure works look at this commit https://github.com/EmilHvitfeldt/textdata/commit/7ce4e422f44d90d681860ad0841b385a990e9628.
Please feel free to ping me if you have any questions or problems
Ok, that sounds good. The downloads will be separate, but then I'll put a parameter in the dataset_ function to just load the appropriate sub-dataset (for 6b and 27b). I should have a PR for this within the next couple hours, depending on what other distractions come up.
I'd like to add the GloVe pre-trained word vectors, for use in https://github.com/tidymodels/textrecipes/issues/20
The datasets are available here: https://nlp.stanford.edu/projects/glove/
There are 4 downloads, that break down like this:
The first one is all I'm directly in need of right now, but it feels worthwhile to work out a standard for all of them while I'm at it.
I don't want to make the functions too complicated to understand, but it feels like maybe it should be one set of textdata functions (download_glove, process_glove, dataset_glove), with arguments about the specifics (something like
dataset_glove({normal stuff plus}, token_set, dimensions)
).Let me know what you think and I can knock this out (I'm doing it anyway for personal/work use, so formalizing it won't be a lot of extra work).