facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.76k stars 4.71k forks source link

code changes for incremental training support #1327

Closed SergeiAlonichau closed 4 months ago

SergeiAlonichau commented 1 year ago

Committer: SergeiAlonichau sergeialonichau@gmail.com

On branch IncrementalTraining Changes to be committed: modified: src/args.cc modified: src/args.h modified: src/densematrix.cc modified: src/dictionary.h modified: src/fasttext.cc modified: src/loss.h modified: src/model.cc modified: src/model.h modified: src/vector.cc modified: src/vector.h

What is this for

Added two new parameters -nepoch index of a current epoch, -inputModel . When -nepoch N is specified the tool exits after each epoch and saves checkpoint files with checkpoint_files_prefix . When -nepoch 0 the checkpoint is not loaded. For large data that does not fit into memory, you need to shuffle it and split into equal large parts (as big as fits into memory) for the best performance.

This allows for:

  1. training and evaluation after each epoch
  2. training on split set of data with all data not fitting into memory at once
  3. fine tuning already trained model

Usage examples:

  1. Regular training in one shot with all the data:
    
    ./fasttext.exe supervised -input in_sample_td_1p.txt -output modelx -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -epoch 10
    ./fasttext test modelx.bin in_sample_td_1p.txt 1

Read 96M words Number of words: 234072 Number of labels: 2 start training... Progress: 100.0% words/sec/thread: 11638890 lr: 0.000000 loss: 0.204641 ETA: 0h 0m

N 4002234 P@1 0.994 R@1 0.994 Number of examples: 4002234


2. Training one epoch after another with checkpoints on the same data:

./fasttext.exe supervised -input in_sample_td_1p.txt -output model0 -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -epoch 10 -nepoch 0 -inputModel empty.bin ./fasttext test model0.bin in_sample_td_1p.txt 1 for e in 1 2 3 4 5 6 7 8 9 ; do p=awk "BEGIN { print $e -1 }" ; echo ./fasttext.exe supervised -input in_sample_td_1p.txt -output model$e -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -inputModel model$p.bin -epoch 10 -nepoch $e ; ./fasttext.exe supervised -input in_sample_td_1p.txt -output model$e -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -inputModel model$p.bin -epoch 10 -nepoch $e ; echo ./fasttext test model$e.bin in_sample_td_1p.txt 1 ; ./fasttext test model$e.bin in_sample_td_1p.txt 1 ; done

...

Read 96M words Number of words: 234072 Number of labels: 2 Update args Load dict from trained model Read 96M words Load dict from training data Read 96M words Number of words: 234072 Number of labels: 2 start training... Progress: 100.0% words/sec/thread: 108804462 lr: 0.000000 loss: 0.208056 ETA: 0h 0m ./fasttext test model8.bin in_sample_td_1p.txt 1 N 4002234 P@1 0.991 R@1 0.991 Number of examples: 4002234 ./fasttext.exe supervised -input in_sample_td_1p.txt -output model9 -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -inputModel model8.bin -epoch 10 -nepoch 9 Read 96M words Number of words: 234072 Number of labels: 2 Update args Load dict from trained model Read 96M words Load dict from training data Read 96M words Number of words: 234072 Number of labels: 2 start training... Progress: 100.0% words/sec/thread: 119974496 lr: 0.000000 loss: 0.188905 ETA: 0h 0m ./fasttext test model9.bin in_sample_td_1p.txt 1 N 4002234 P@1 0.993 R@1 0.993 Number of examples: 4002234


3. Test training one epoch after another with two different parts of TD:

$ wc -l td*txt 2001138 td_part1.txt 2001096 td_part2.txt 4002234 total

./fasttext.exe supervised -input td_part2.txt -output model0 -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -epoch 2 -nepoch 0 ./fasttext test model0.bin in_sample_td_1p.txt 1 ./fasttext.exe supervised -input td_part1.txt -output model1 -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -inputModel model0.bin -epoch 2 -nepoch 1 ./fasttext test model1.bin in_sample_td_1p.txt 1

N 4002234 P@1 0.805 R@1 0.805 Number of examples: 4002234


Compare it to the 1 epoch e2e without a split:

./fasttext.exe supervised -input in_sample_td_1p.txt -output modely -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -epoch 1 ./fasttext test modely.bin in_sample_td_1p.txt 1 N 4002234 P@1 0.805 R@1 0.805 Number of examples: 4002234


4. Train with 2 parts of data for 10 epoch (equivalent to examples 1 & 2 but data are split into two random equal in size parts):

./fasttext.exe supervised -input td_part2.txt -output model0 -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -epoch 20 -nepoch 0 ./fasttext test model0.bin in_sample_td_1p.txt 1 ./fasttext.exe supervised -input td_part1.txt -output model1 -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -inputModel model0.bin -epoch 20 -nepoch 1 ./fasttext test model1.bin in_sample_td_1p.txt 1

for e in seq 2 2 19 ; do

p=awk "BEGIN { print $e -1 }" ; n=awk "BEGIN { print $e +1 }" ;

echo ./fasttext.exe supervised -input td_part2.txt -output model$e -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -inputModel model$p.bin -epoch 20 -nepoch $e ; ./fasttext.exe supervised -input td_part2.txt -output model$e -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -inputModel model$p.bin -epoch 20 -nepoch $e ;

echo ./fasttext.exe supervised -input td_part1.txt -output model$n -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -inputModel model$e.bin -epoch 20 -nepoch $n ; ./fasttext.exe supervised -input td_part1.txt -output model$n -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -inputModel model$e.bin -epoch 20 -nepoch $n ;

./fasttext test model$n.bin in_sample_td_1p.txt 1 ;

done

... Read 48M words Number of words: 228529 Number of labels: 2 Update args Load dict from trained model Read 48M words Load dict from training data Read 48M words Number of words: 228529 Number of labels: 2 start training... Progress: 100.0% words/sec/thread: 207331200 lr: 0.000000 loss: 0.194417 ETA: 0h 0m N 4002234 P@1 0.993 R@1 0.993 Number of examples: 4002234


5. Test OVA Loss

./fasttext.exe supervised -input td_part2.txt -output model0 -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -loss ova -epoch 2 -nepoch 0 ./fasttext test model0.bin in_sample_td_1p.txt 1 ./fasttext.exe supervised -input td_part1.txt -output model1 -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -loss ova -inputModel model0.bin -epoch 2 -nepoch 1 ./fasttext test model1.bin in_sample_td_1p.txt 1


Compare it to the 1 epoch e2e without a split:

./fasttext.exe supervised -input in_sample_td_1p.txt -output modely -dim 2 -wordNgrams 6 -bucket 80000000 -thread 10 -verbose 1 -loss ova -epoch 1 ./fasttext test modely.bin in_sample_td_1p.txt 1

A: 0h 0m 0s N 4002234 P@1 0.808 R@1 0.808 Read 48M words Number of words: 228529 Number of labels: 2 Update args Load dict from trained model Read 48M words Load dict from training data Read 48M words Number of words: 228529 Number of labels: 2 Progress: 100.0% words/sec/thread: 2326473 lr: 0.000000 avg.loss: 0.855847 ETA: 0h 0m 0s N 4002234 P@1 0.821 R@1 0.821

Read 96M words Number of words: 234072 Number of labels: 2 Progress: 100.0% words/sec/thread: 1138778 lr: 0.000000 avg.loss: 0.854935 ETA: 0h 0m 0s

N 4002234 P@1 0.821 R@1 0.821

facebook-github-bot commented 1 year ago

Hi @SergeiAlonichau!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

SergeiAlonichau commented 1 year ago

I have signed the CLA.

facebook-github-bot commented 1 year ago

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

loretoparisi commented 1 year ago

+1 to merge this PR!

minhhieuuuuu88 commented 1 year ago

+1 Minhh Hiếuuu

Vào Th 5, 23 thg 3, 2023 vào lúc 02:00 Loreto Parisi < @.***> đã viết:

+1 to merge this PR!

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/fastText/pull/1327#issuecomment-1480106994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWWPGCCMX7NFFVPMITNCXQLW5ND3ZANCNFSM6AAAAAAWDHSJDM . You are receiving this because you are subscribed to this thread.Message ID: @.***>