flashlight / flashlight

A C++ standalone library for machine learning
https://fl.readthedocs.io/en/latest/
MIT License
5.29k stars 497 forks source link

Unable to train using fl_asr_train with fork option using the AM file (am_500ms_future_context_dev_other.bin) #456

Open vchagari opened 3 years ago

vchagari commented 3 years ago

Bug Description

Getting Coredump while running fl_asr_train app with the fork option with the Librispeech AM file, please find the details below.

Error: E0204 16:25:50.015505 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment). Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE. If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

####### Details: ./fl_asr_train fork /data/set3/am_500ms_future_context_dev_other.bin --flagsfile=/data/set3/for_training_fork_am_500ms_future_context.cfg --minloglevel=0 --rundir=/data/set3/02_04_2021 --rndv_filepath="" I0204 16:25:15.639739 32766 Train.cpp:54] Parsing command line flags I0204 16:25:15.639756 32766 Train.cpp:57] Reading flags from file /data/set3/for_training_fork_am_500ms_future_context.cfg W0204 16:25:15.639839 32766 Helpers.cpp:91] Did not find scalefactor, using the flag's value. I0204 16:25:15.639843 32766 Helpers.cpp:97] Using initial scale factor 1 Initialized NCCL 2.8.3 successfully! I0204 16:25:15.898775 32766 Train.cpp:197] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --ipl_maxisz=1.7976931348623157e+308; --ipl_maxtsz=9223372036854775807; --ipl_minisz=0; --ipl_mintsz=0; --ipl_relabel_epoch=10000000; --ipl_relabel_ratio=1; --ipl_seed_model_wer=-1; --ipl_use_existing_pl=false; --unsup_datadir=; --unsup_train=; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=/data/set3/am_500ms_future_context.arch; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batching_max_duration=0; --batching_strategy=none; --batchsize=8; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/data/set3; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --everstoredb=false; --features_type=mfsc; --fftcachesize=1; --filterbanks=80; --fl_amp_max_scale_factor=32000; --fl_amp_scale_factor=4096; --fl_amp_scale_factor_update_interval=2000; --fl_amp_use_mixed_precision=false; --fl_benchmark_mode=true; --fl_log_level=; --fl_log_mem_ops_interval=0; --fl_optim_mode=; --fl_vlog_level=0; --flagsfile=/data/set3/for_training_fork_am_500ms_future_context.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --highfreqfilterbank=-1; --inputfeeding=false; --isbeamdump=false; --iter=100000000; --itersave=true; --labelsmooth=0; --leftWindowSize=50; --lexicon=/data/set3/decoder-unigram-10000-nbest10-02-04-2021.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --lmweight_high=4; --lmweight_low=0; --lmweight_step=0.20000000000000001; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lowfreqfilterbank=0; --lr=0.01; --lr_decay=10000; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxload=-1; --maxrate=10; --maxsil=50; --maxword=-1; --melfloor=1; --mfcccoeffs=13; --minrate=3; --minsil=0; --momentum=0.80000000000000004; --netoptim=sgd; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --pctteacherforcing=100; --pcttraineval=1; --pretrainWindow=0; --replabel=0; --reportiters=1000; --rightWindowSize=50; --rndv_filepath=; --rundir=/data/set3/02_04_2021; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=0; --sfx_config=; --sfx_start_update=2147483647; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --test=; --tokens=/data/set3/librispeech-train-all-unigram-10000.tokens; --train=lists/train.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --usememcache=false; --uselexicon=true; --usewordpiece=true; --valid=lists/dev.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0204 16:25:15.899092 32766 Train.cpp:198] Experiment path: /data/set3/02_04_2021 I0204 16:25:15.899096 32766 Train.cpp:199] Experiment runidx: 1 I0204 16:25:15.901998 32766 Train.cpp:272] Number of classes (network): 9998 I0204 16:25:16.854326 32766 Train.cpp:279] Number of words: 204170 E0204 16:25:18.193625 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment). Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE. If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:19.343036 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment). Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE. If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:21.500684 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment). Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE. If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:25.676436 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment). Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE. If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:33.849670 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment). Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE. If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:50.015505 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment). Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE. If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

terminate called after throwing an instance of 'cereal::Exception' what(): Trying to load an unregistered polymorphic type (w2l::SpecAugment). Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE. If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT. Aborted at 1612484750 (unix time) try "date -d @1612484750" if you are using GNU date PC: @ 0x7f6805cecfb7 gsignal SIGABRT (@0x3ed00007ffe) received by PID 32766 (TID 0x7f684fb12000) from PID 32766; stack trace: @ 0x7f6848fda980 (unknown) @ 0x7f6805cecfb7 gsignal @ 0x7f6805cee921 abort @ 0x7f6806910957 (unknown) @ 0x7f6806916ae6 (unknown) @ 0x7f6806916b21 std::terminate() @ 0x7f6806916da9 __cxa_rethrow @ 0x55949666d179 main @ 0x7f6805ccfbf7 __libc_start_main @ 0x5594966fa79a _start Aborted (core dumped)

Platform and Hardware

[Please list your operating system, [GPU] hardware, compiler, and other details if relevant] Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu version - 18.04 LTS Python version: Python 3.6.9 Bazel version (if compiling from source): N/A GCC/Compiler version (if compiling from source): N/A CUDA/cuDNN version: 10.1/7.6.4.38 GPU model and memory: NVIDIA-SMI 460.27.04 Driver Version: 460.27.04

tlikhomanenko commented 3 years ago

This model was trained with old codebase that is why it cannot be right now reused by the new codebase.

Solutions:

cc @vineelpratap @avidov

vchagari commented 3 years ago

Thank you @tlikhomanenko. @vineelpratap, @avidov : Could you please help me converting the model to the new format?.

tlikhomanenko commented 3 years ago

Converting models will be here https://github.com/facebookresearch/flashlight/pull/524