flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.88k stars 2.1k forks source link

Explanation of embedding_in_memory and embeddings sizes #720

Closed mnishant2 closed 5 years ago

mnishant2 commented 5 years ago

I am trying to train a sequence tagger using flair embedding, my dataset is comparatively bigger and my 32 GB ram was not proving to be enough to store the data and embeddings, I worked around it by setting use_cache=True and embeddings_in_memory=False. I checked the size of the embedding SQL cached and its size was huge(~140 GB) and increasing with every iteration. My question is, why does the size increase with each iteration, should not the embeddings remain the same size and just be updated after each iteration, could you explain the working as well as the data loading, seems to me that all the data is loaded at once. What shall happen of the embedding once training is finished, will it be saved with the model file?

alanakbik commented 5 years ago

Hello @mnishant2 - thanks for posting these questions. To answer each in turn:

mnishant2 commented 5 years ago

@alanakbik Thank you so much for the prompt explanation. Yeah, the cache files are not growing after the first pass of the dataset, it was a bug from my side. Regarding point2, if we use the mentioned on the fly approach of generating and discarding the embeddings at each epoch, would that not overload the RAM for a large dataset. Is there a caching mechanism for other embeddings like say ELMO, should we decide to finetune it

alanakbik commented 5 years ago

You can combine use_cache=False (i.e. not caching) and embeddings_in_memory=False, which means embeddings are not kept in memory, so it will not overload the RAM. However this comes at the cost that embeddings need to be generated on-the-fly at each epoch, meaning increased GPU cost. That is how we currently do it for large datasets.

We did not yet implement a caching mechanism for ELMo or BERT - I still think the caching for FlairEmbeddings could be improved, but once we have found a good solution here we could expand this feature to the other embedding classes.

alanakbik commented 5 years ago

Closing for now, but feel free to reopen if you have more questions.