RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
https://rasa.com/docs/rasa/
Apache License 2.0
18.81k stars 4.62k forks source link

MemoryError with tensorflow_embedding on ~73k dataset with 38 intents #1621

Closed HarshKhadloya closed 4 years ago

HarshKhadloya commented 5 years ago

As mentioned in the title, I am feeding in ~73k lines of training data classified into 38 intents. And I would end up using ~200k lines of messages to create my final model. But even for 73k, I get a MemoryError. This doesn't seem to be a RAM issue as I don't see my RAM getting fully used up while running the training code. Any inputs would be valuable. Below are the details:

Rasa NLU version: 0.13.8 Operating system: Windows Server 2016

Training the model as:

python -m rasa_nlu.train -c nlu_config.yml --data rasa_classification_train_set.md -o models --fixed_model_name nlu_classify_75k_38ctgy --project current --verbose

Content of model configuration file:

language: "en"

pipeline: "tensorflow_embedding"

Output / Issue:

2019-01-14 08:40:41 INFO     rasa_nlu.training_data.loading  - Training data format of rasa_classification_train_set.md is md
2019-01-14 08:40:43 INFO     rasa_nlu.training_data.training_data  - Training data stats:
        - intent examples: 73962 (38 distinct intents)
** removing entity names **
        - entity examples: 0 (0 distinct entities)
        - found entities:

2019-01-14 08:40:46 INFO     rasa_nlu.model  - Starting to train component tokenizer_whitespace
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Finished training component.
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Starting to train component ner_crf
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Finished training component.
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Starting to train component ner_synonyms
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Finished training component.
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Starting to train component intent_featurizer_count_vectors
Traceback (most recent call last):
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\train.py", line 184, in <module>
    num_threads=cmdline_args.num_threads)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\train.py", line 154, in do_train
    interpreter = trainer.train(training_data, **kwargs)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\model.py", line 196, in train
    **context)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\featurizers\count_vectors_featurizer.py", line 214, in train
    X = self.vect.fit_transform(lem_exs).toarray()
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 947, in toarray
    out = self._process_toarray_args(order, out)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\sparse\base.py", line 1184, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

During this runtime, I dont see my RAM getting used up more than 6GB, even though I have a 16GB RAM. Thanks for your help!

akelad commented 5 years ago

Thanks for raising this issue, @wochinge will get back to you about it soon.

wochinge commented 5 years ago

@HarshKhadloya Do you have python 32bit / 64 bit installed?

HarshKhadloya commented 5 years ago

@HarshKhadloya Do you have python 32bit / 64 bit installed?

@wochinge Its 64-bit.

platform.architecture() : ('64bit', 'WindowsPE')

wochinge commented 5 years ago

Can you do is_64bits = sys.maxsize > 2**32 in your python shell? The question is not whether Windows is 32bit or 64bit, but whether you have got the python binaries installed for 32bit or 64bit.

HarshKhadloya commented 5 years ago

@wochinge Well, the output is TRUE for sys.maxsize > 2**32 Let me know if you need any further details.

wochinge commented 5 years ago

Thanks for the result of the command! :-) Can you please check how much memory the python process is using when the MemoryError is happening?

HarshKhadloya commented 5 years ago

Hi @wochinge Its using only around 1.5 GB..

memory error

HarshKhadloya commented 5 years ago

Hi, is anyone able to understand whats happening here? Any updates would be really helpful. Thanks.

wochinge commented 5 years ago

@HarshKhadloya Do you have multiple Python versions installed? Since the Python process exactly consumes 2 GiB of memory when it is crashing, I assume that you are somehow using a Python 32bit version. Are you using a virtualenv?

HarshKhadloya commented 5 years ago

@wochinge No, I dont have multiple versions installed. I just have 1 version installed via Anaconda. Some additional information: Since Rasa was not compatible (I was encountering issues) with the latest python version of 3.7, I uninstalled Anaconda & installed a 3.5 version. I am doing this work on an EC2 instance. Let me know if further info is reqd. capture

akelad commented 5 years ago

we've been experiencing some memory errors ourselves, it might just be that the array it's about to create would be too big to fit into memory. The point where it breaks is when it's converting a scipy sparse array into a numpy array -- the numpy array is much bigger than the scipy sparse array which is probably what's causing that. We don't really have a quick fix for that right now, but may be merging a fix for that in future as we're working on optimising training for the tensorflow pipeline ourselves

kenzydarioo commented 5 years ago

@akelad is it possible to split the .md training data and train it separately but somehow append it to one model in the end ? because i'm experiencing the same thing using tensorflow embedding config. Thanks in advance.

HarshKhadloya commented 5 years ago

@akelad I had the same hypothesis given the point where it breaks. Thanks for your response & hoping the Rasa team will be fixing this in future as working with Rasa module has been very helpful. I ended up building an independent classification model (fine for my use case) & will be using Rasa for entity extraction (better usability than CRF).

@kenzydarioo A workaround would be to manually split your data and build sequential models for additional intents. These additional intents can be tagged as 'Others' in the previous model - As the 'MemoryError' seems majorly due to the # of intents. For example, I was able to create a model with 200k training data with just 5 intents (though most of the data were duplicates). Let us know on the approach which worked for you!

adirizka7 commented 5 years ago

@HarshKhadloya is there any entity in your training data ? Can you tell me how to build sequential models and then combine them into one model ?

HarshKhadloya commented 5 years ago

@adirizka7 Yes, I also have entities in my training data. I was suggesting to use the sequential models sequentially, & not combine them into one. Just as a crude example, lets say there are 5 intents - A, B, C, D & E. In the first model, we will be predicting A, B & Others (which will have C, D & E). The second model will be predicting C, D & E. The second model is to be scored on the data which were predicted as 'Others' from the first model. This might be handy if you have large number of intents. For smaller numbers, like in the example, a single Rasa model should work.

kenzydarioo commented 5 years ago

@HarshKhadloya thanks for the tips, but is there any other way to do this without creating multi agent? like combining into one model from separate models?

akelad commented 5 years ago

i'm gonna leave this open, because it is an issue and we are looking into it

HarshKhadloya commented 5 years ago

Thanks @akelad

@kenzydarioo It really depends on your use case. I can share my thoughts if I know a background of what you are trying to achieve. In my case, as mentioned, I was able to create an independent classification model - regular LR/SVM on the DTM of my dataset.

TatianaParshina commented 5 years ago

I have the same problem with MemoryError and can't train my model using tensorflow_embedding on a big training set. As a workaround, I train model only on a small training set.

alvipranandha commented 5 years ago

yes, I have the same issue with around 90k rows of the dataset with 144 intents, so how do you guys solve it while waiting for rasa team to fix this problem?

akelad commented 5 years ago

allocate more memory to the machine... I'm afraid there's no work around just yet, our fix is still a work in progress

wibimaster commented 5 years ago

Same problem here with 8k intent and 1 - 4 common_examples for each.. 50Gib memory allocated, only 2Gib used when failed..

Using Docker on Ubuntu 18.04 (FROM python:3.6.8-slim-stretch)

rasa-config.yml :

language: "fr"

pipeline:
- name: "nlp_spacy"
- name: "tokenizer_spacy"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
  intent_tokenization_flag: true
  intent_split_symbol: "+"

From Python console :

>>> import sys
>>> is_64bits = sys.maxsize > 2**32
>>> print(is_64bits)
True

From docker stats :

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
494fdf122804 rasanlu_python_prod_1 0.01% 2.182GiB / 50GiB 4.36% 13.6MB / 135kB 0B / 54.2MB 82

CPU peak at 2365% (24 cores), 50 Gib never reached (no difference with 120Gib)

Error from logs :

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/train.py", line 174, in <module>
    num_threads=cmdline_args.num_threads)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/train.py", line 149, in do_train
    interpreter = trainer.train(training_data, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/model.py", line 190, in train
    **context)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/classifiers/embedding_intent_classifier.py", line 446, in train
    training_data, intent_dict)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/classifiers/embedding_intent_classifier.py", line 272, in _prepare_data_for_training
    all_Y = self._create_all_Y(X.shape[0])
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/classifiers/embedding_intent_classifier.py", line 256, in _create_all_Y
    all_Y = np.stack([self.encoded_all_intents for _ in range(size)])
  File "/usr/local/lib/python3.6/site-packages/numpy/core/shape_base.py", line 423, in stack
    return _nx.concatenate(expanded_arrays, axis=axis, out=out)
MemoryError
wibimaster commented 5 years ago

Okay, seems to be a different problem because it doesn't fail on same method (concatenate() in my example, zeros() in the original bug submission from @HarshKhadloya ).

But it seems to be related to Numpy in all case...

RAM isn't full when it fails, so I have a question ; does Numpy evaluate the memory amount that's going to be used BEFORE doing the concatenate ?

So, maybe it's evaluated to more than 120Gb, and it fails before using it, and it could be the reason why we doesn't see the problem ?

Original OP see the same thing ; in his case, 6Gb used on 16Gb (in my case, 2.2Gb on 50 or 120Gb)

Any news / tips on that ? Thanks !

akelad commented 5 years ago

yeah it's because we're using numpy arrays here, which at this point take up a huge amount of memory. Or are about to, as you said. The solution to this is using sparse arrays, which we are in a separate branch which isn't quite ready to be merged yet. We will be merging it in the next few months so you should have no more problems with memory at that point

XiaofeiQian commented 5 years ago

I have the same issue, could we get this update in a few recent minor releases?

@akelad Thanks!

akelad commented 5 years ago

Still no update on this, sorry!

sagardawda7 commented 5 years ago

Since Vectorizer tries to create a vector with all words as features, it can lead to Memory Error on large corpus. You can restrict the max_features. I modified my config file as follows and the issue was resolved:

language: en
pipeline: 
  - name: CountVectorsFeaturizer
    max_features: 1000
  - name: EmbeddingIntentClassifier
alvipranandha commented 5 years ago

Since Vectorizer tries to create a vector with all words as features, it can lead to Memory Error on large corpus. You can restrict the max_features. I modified my config file as follows and the issue was resolved:

language: en
pipeline: 
  - name: CountVectorsFeaturizer
    max_features: 1000
  - name: EmbeddingIntentClassifier

Well, I just followed your config on CountVectorsFeaturizer but still got a memory error

sagardawda7 commented 5 years ago

Since Vectorizer tries to create a vector with all words as features, it can lead to Memory Error on large corpus. You can restrict the max_features. I modified my config file as follows and the issue was resolved:

language: en
pipeline: 
  - name: CountVectorsFeaturizer
    max_features: 1000
  - name: EmbeddingIntentClassifier

Well, I just followed your config on CountVectorsFeaturizer but still got a memory error

@alvipranandha Can you make one small change mentioned below and check?

language: en
pipeline: 
  - name: CountVectorsFeaturizer
    max_features: 1000
  - name: EmbeddingIntentClassifier
    intent_tokenization_flag: true # Since you have multiple intents
    batch_size: [32, 64] # Default is [64, 256]. Larger batch sizes occupy more memory

Let me know how it goes

alvipranandha commented 5 years ago

@alvipranandha Can you make one small change mentioned below and check?

language: en
pipeline: 
  - name: CountVectorsFeaturizer
    max_features: 1000
  - name: EmbeddingIntentClassifier
    intent_tokenization_flag: true # Since you have multiple intents
    batch_size: [32, 64] # Default is [64, 256]. Larger batch sizes occupy more memory

Let me know how it goes

Thank you for your config, now I can run it without memory error with around 95k rows then around 144 intents and around 28 entities. But the result is still not good, need the best tune hyperparameters for a custom dataset.

sagardawda7 commented 5 years ago

Thats awesome. You can try changing the max_features size and look for results. I have 8 core system with 16 GB RAM and my value for max_features was 5000

Alternatively you can develop a custom featurizer using TFIDF Vectorizer. And set max features to whatever fits in your memory. TFIDF may help boost the results.

alvipranandha commented 5 years ago

Well, thank you for your suggestion @sagardawda7. I have a quad-core system with 16 GB RAM and still searching for the best config for our dataset. How do I know to test or check the condition for max_features, in order not in a memory error condition? Is it check one by one? or there are other methods?

Arvind-Kumar-Agrawal commented 5 years ago

@akelad As mentioned above, team is trying to replace numpy with something else. Do we have any update? what is the branch name where i can find the fix? I would like to see the solution if it is yet to be merged. please let me know.

akelad commented 4 years ago

@Ghostvv could you update everyone on the latest status of this?

Ghostvv commented 4 years ago

we have a branch, where we're using sparse matrices instead of dense numpy arrays, but it is implemented for the new architecture that we're working on. @tabergma could you please link the branch here

tabergma commented 4 years ago

We have two branches:

  1. The following branch uses sparse matrices in the CountVectorsFeaturizer. The features are then used in the EmbeddingIntentClassifier. However, the code is not cleaned up. https://github.com/RasaHQ/rasa/tree/entity-recognition
  2. We are currently cleaning up the above branch and moving everything to https://github.com/RasaHQ/rasa/tree/combined-entity-intent-model. (It might take another 1-2 weeks until all the functionality of the first branch is on this branch.)
suryavamsi1563 commented 4 years ago

Hi guys. Iam facing the same memory issue. Is the new branch ready.Can i use the branch "https://github.com/RasaHQ/rasa/tree/combined-entity-intent-model." or should I use the latest RASA git version??

tabergma commented 4 years ago

@suryavamsi1563 The branch https://github.com/RasaHQ/rasa/tree/combined-entity-intent-model is not ready yet. We faced some issues on the way. You should be able to use it beginning of next week.

igormis commented 4 years ago

I have training data with the following characteristics:

- intent examples: 11263 (2 distinct intents)
    - Found intents: 'general', 'irrelevant'
    - Number of response examples: 0 (0 distinct response)
    - entity examples: 9407 (22 distinct entities)
    - found entities: '', 'company', 'amount_price_target', 'analyst', 'financial_topic', 'financial_instrument', 'period', 'person', 'price_movement', 'hashtag', 'publication', 'ticker', 'amount', 'percent', 'number', 'media_type', 'location', 'rating_agency', 'event', 'exchange', 'product', 'sector'

When I run the command rasa test nlu --config pretrained_embeddings_spacy.yml supervised_embeddings.yml --nlu CF_model/config_en.json --runs 3 --percentages 0 25 50 70 90 I get memory error, any ideas how to solve this?

Maadhav commented 4 years ago

I have training data with the following characteristics:

- intent examples: 11263 (2 distinct intents)
  - Found intents: 'general', 'irrelevant'
  - Number of response examples: 0 (0 distinct response)
  - entity examples: 9407 (22 distinct entities)
  - found entities: '', 'company', 'amount_price_target', 'analyst', 'financial_topic', 'financial_instrument', 'period', 'person', 'price_movement', 'hashtag', 'publication', 'ticker', 'amount', 'percent', 'number', 'media_type', 'location', 'rating_agency', 'event', 'exchange', 'product', 'sector'

When I run the command rasa test nlu --config pretrained_embeddings_spacy.yml supervised_embeddings.yml --nlu CF_model/config_en.json --runs 3 --percentages 0 25 50 70 90 I get memory error, any ideas how to solve this?

Were you able to solve it?

igormis commented 4 years ago

No, unfortunately, I did not. Any suggestions?

tabergma commented 4 years ago

@igormis Did you used the latest Rasa version? I assume --config pretrained_embeddings_spacy.yml corresponds to pipeline: "pretrained_embeddings_spacy"? Is the model training if you do just a single training run with rasa train nlu? When exactly is the memory error occurring, when loading the data, during training, during evaluation?

Maadhav commented 4 years ago

No, unfortunately, I did not. Any suggestions?

I was able to solve my problem. Actually I was running both, Rasa x and Rasa run simultaneously which was creating memory problem. So I closed rasa x then rasa run started like a charm.

suryavamsi1563 commented 4 years ago

Hi @tabergma , is the sparse arrays branch 'https://github.com/RasaHQ/rasa/tree/entity-recognition' ready for use ?. I am facing a memory error and desperately need a work around or a solution.

tabergma commented 4 years ago

@suryavamsi1563 We merged sparse features into master and released it with Rasa 1.6.0. So, just use the latest Rasa version. Let me know if you are still running into issues.

akelad commented 4 years ago

I am going to close this as it should be fixed with 1.6. If there are still issues, please create a new issue