axa-group / nlp.js

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more
MIT License
6.28k stars 621 forks source link

Adding documents after training lead to different model.nlp file than adding all documents all at once then training #192

Closed mmayla closed 1 year ago

mmayla commented 5 years ago

Describe the bug Can we add documents after training then train again? and so on? for example, when I have a system that keeps training new intents so I add the new intents periodically and retrain the NLP manager.

This is critical to me because I have thousands of intents stored in a database, and it keeps increasing, and training it on the first time takes hours, so it's not applicable to restart the training, I want to just add the new intents and call the train function again. I tried it this way and it definitely faster.

The problem is the output file is different when I distribute the intents on several training batches. so is this a problem?

To Reproduce Steps to reproduce the behaviour: Case 1:

  1. Add X number of intents nlpManager.addDocument
  2. Train nlpManager.train
  3. Add Y number of intents
  4. Trian again
  5. Save file model.nlp

Case 2:

  1. Add X + Y intents nlpManager.addDocument
  2. Train nlpManager.train
  3. Save file model.nlp

Expected behaviour The output model.nlp from case 1 and case 2 should have the same size (same output) but the model.nlp from case 1 have smaller size always.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context node-nlp v3.0.3

JoshuaeKaiser commented 5 years ago

You need to create a new nlpManager everytime you run a train cycle and replace your model with the new model.

aponomy commented 5 years ago

Hello everyone!

@MMayla: Have you found a solution to this? @jesus-seijas-sp: Thanks for all your great work on this package! Is it possible to use incremental training like suggested in this issue?

mmayla commented 5 years ago

@aponomy the only solution I found is to not add documents after training. If I have new documents I create new nlpManager and train it. like @JoshuaeKaiser suggested

aponomy commented 5 years ago

@MMayla thanks, sorry for me not getting it straight away - so you end up with two models then? And can you then merge these models somehow, in order to have a single response to a single utterance?

mmayla commented 5 years ago

@aponomy no actually it's a lot simpler, I train all my data again (old + new) and have a new model that I use to replace the old model. :smiley:

The downside of this solution is the slow training time because in my situation there are new intents that got added every minute and I wanted the training to be realtime as possible but with the current version incremental training not supported and I don't know I actually if it can be supported so currently I set up scheduler where every hour I fetch the data and retrain the model.

Also, I had another problem with training which is process blocking, where training block any other code to run till it finishes training which was not acceptable in my situation, so I have set up a Child Process worker to run training process in the background (another process) to train a new model and when it finishes it communicate that, and the old model got replaces with the new model. (there are other solutions to overcome that but this was the cleanest and fastest way in our situation, although it was hell to do at first :cry:)

Sorry for the very long comment, I just remember how I struggled at first with this, and I wanted someone to point solutions to me :grin:

aponomy commented 5 years ago

@MMayla Thanks for putting your effort into answering, I appreciate it! So there's really no easy way to have incremental training then. To train everything after every change can be a showstopper for me.

@jesus-seijas-sp if you have any idea how that could be done it would be fantastic.

@MMayla Sounds like a nice solution you got there :)

jesus-seijas-sp commented 5 years ago

Hello, if you take a look here: https://github.com/axa-group/nlp.js-app/blob/master/server/trainers/nlpjs-trainer.js you will see that what we do is to discard the previous model and train a new one inside a child process.

The reason to don't do incremental trains: You have one perceptron per intent, each perceptron must see all the examples (utterances), some of them trying to return a 1 (the utterance is from the intent of the perceptrom) and some of them a 0 (the utterance is not from the intent of the perceptron). If you train, the weights and bias are calculated. If now you add a new utterance to an intent... this impact in all the perceptrons. The problem is that the condition to exit is based on the square error from all the perceptrons... that means that the new training of each perceptron will not learn correctly this new utterance. The weights will not be the same as if you train again. So is a mathematical issue.... in this case if we want a perfect model, all should be retrained, otherwise the already existing weights an bias can make the model fail for the new utterances.

jesus-seijas-sp commented 5 years ago

Also... the performance improvements on the last versions, make that for usual models the time to train is seconds. Of course in big models this is not possible.. but take into account that the maximum number of intents on all the other tools is 500 or 2000... So well... I try hard to improve the times, but with the same accuracy and quality. Time can be improved, but at the cost of accuracy.

aponomy commented 5 years ago

Thank you @jesus-seijas-sp, wonderful to hear you explanation. I'm learning. I think you've made great work improving the training times. I'm happy to continue explore this library!

aigloss commented 1 year ago

Closing due to inactivity