Store and load pre-trained NLP models

m-milek commented 3 months ago

Figure out a way to conveniently store, load and share pre-trained NLP models.

Any file formats we can use?
Libraries to generate and load them?
Where to physically store the models? Separate GitHub repo?
How to make them easily share-able and downloadable for development purposes?

m-milek commented 2 months ago

So, how is it coming along?

Smixie commented 2 months ago

Recently after talk with friend he told me about Github LFS. It will allow us to store files up to 5GB in our repository. The configuration looks more than easy.

Libriaries: I read some article and most of them pointed to NLTK, spaCy, TextBlob, Hugging Face Transformers. But about that i need to dive in much more. What did you used during classes? Maybe it will be a good palce to start.

About formats used I check files extensions on Hugging Face and people used mostly h5 and sometimes json and bin.

What do you think about that?

m-milek commented 2 months ago

I think during classes we've used Scikit. h5 is something I've also came across as a first choice when doing some googling. Sounds good!

When it comes to the file storage, I can see that GitHub LFS has some pretty strict bandwidth limitations that we could easily exceed (1GiB). I propose a solution:

Storage solution: Mega or Google Drive (whichever has a more convenient way of accessing public data through Python APIs)
Models stored under versioned directories.
Example: sentiment/v3/sentiment-v3.h5
The data has to be publicly accessible
models.version file in whatever format you want
Python script to check for updates, pull and push new models

What do you think?

RMoodsTeam / RMoods

Store and load pre-trained NLP models #69