Explanation of embedding_in_memory and embeddings sizes

mnishant2 commented 5 years ago

I am trying to train a sequence tagger using flair embedding, my dataset is comparatively bigger and my 32 GB ram was not proving to be enough to store the data and embeddings, I worked around it by setting use_cache=True and embeddings_in_memory=False. I checked the size of the embedding SQL cached and its size was huge(~140 GB) and increasing with every iteration. My question is, why does the size increase with each iteration, should not the embeddings remain the same size and just be updated after each iteration, could you explain the working as well as the data loading, seems to me that all the data is loaded at once. What shall happen of the embedding once training is finished, will it be saved with the model file?

alanakbik commented 5 years ago

Hello @mnishant2 - thanks for posting these questions. To answer each in turn:

Data loading: The current version does indeed load the entire dataset into memory, which is impractical if you have a lot of data as is common for text classification tasks. We are working on a solution based on the PyTorch DataLoader in this branch. I will push some changes soon that enable streaming data loading so the whole dataset is not read into memory. It already works locally but I want to do more testing before merging into master.
Large cache files: Unfortunately, if the dataset is very large this means the cached file willl grow very large as well. This is because each word gets a contextualized embedding, so each word in each context needs to be materialized. Worse, vectors are difficult to compress, so we do not currently know what we can do here. So when we train on very large datasets, we currently do not cache embeddings but have them generated at each epoch on-the-fly as needed and then discarded.
Growing cache files at each epoch: This should not happen and sounds like a bug. Once the embeddings are computed that should be it and it should just retrieve them from the files.

mnishant2 commented 5 years ago

@alanakbik Thank you so much for the prompt explanation. Yeah, the cache files are not growing after the first pass of the dataset, it was a bug from my side. Regarding point2, if we use the mentioned on the fly approach of generating and discarding the embeddings at each epoch, would that not overload the RAM for a large dataset. Is there a caching mechanism for other embeddings like say ELMO, should we decide to finetune it

alanakbik commented 5 years ago

You can combine use_cache=False (i.e. not caching) and embeddings_in_memory=False, which means embeddings are not kept in memory, so it will not overload the RAM. However this comes at the cost that embeddings need to be generated on-the-fly at each epoch, meaning increased GPU cost. That is how we currently do it for large datasets.

We did not yet implement a caching mechanism for ELMo or BERT - I still think the caching for FlairEmbeddings could be improved, but once we have found a good solution here we could expand this feature to the other embedding classes.

alanakbik commented 5 years ago

Closing for now, but feel free to reopen if you have more questions.

flairNLP / flair

Explanation of embedding_in_memory and embeddings sizes #720