browsermt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
20 stars 7 forks source link

browsermt/marian-dev regression-test-failures #17

Open jerinphilip opened 3 years ago

jerinphilip commented 3 years ago

Status

Logs

**Logs** http://vali.inf.ed.ac.uk/jenkins/job/browsermt-marian-regression-tests/7/console ``` Failed: - tests/scorer/scores/test_scores_cpu.sh - tests/decoder/intgemm/test_intgemm_16bit.sh - tests/decoder/intgemm/test_intgemm_16bit_sse2.sh - tests/decoder/intgemm/test_intgemm_8bit.sh - tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh - tests/models/wnmt18/test_student_small_aan_intgemm16.sh Logs: - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/scorer/scores/test_scores_cpu.sh.log - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_16bit.sh.log - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_16bit_sse2.sh.log - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_8bit.sh.log - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh.log - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/models/wnmt18/test_student_small_aan_intgemm16.sh.log ```

Issue updated as I figure what exactly is failing.

Available Machines, vector instructions

``` ansible -m shell -a "grep -o -e 'avx[^ ]*' -e 'sse[^ ]*' -e ssse3 /proc/cpuinfo | sort | uniq | tr '\n' ' '" gpu --limit '!fulla' dagr | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 elli | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 baldur | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 bil | CHANGED | rc=0 >> avx avx2 avx512cd avx512f sse sse2 sse4_1 sse4_2 ssse3 buri | CHANGED | rc=0 >> sse sse2 sse4_1 sse4_2 ssse3 hodor | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 frigg | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 hretha | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 gna | CHANGED | rc=0 >> avx sse sse2 sse4_1 sse4_2 ssse3 lofn | CHANGED | rc=0 >> avx sse sse2 sse4_1 sse4_2 ssse3 mani | CHANGED | rc=0 >> avx avx2 avx512cd avx512f sse sse2 sse4_1 sse4_2 ssse3 mimir | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 meili | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 rindr | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 sigyn | CHANGED | rc=0 >> avx avx2 avx512cd avx512f sse sse2 sse4_1 sse4_2 ssse3 startiger | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 vor | CHANGED | rc=0 >> avx avx2 avx512cd avx512f sse sse2 sse4_1 sse4_2 ssse3 snotra | CHANGED | rc=0 >> avx sse sse2 sse4_1 sse4_2 ssse3 thrud | CHANGED | rc=0 >> avx sse sse2 sse4_1 sse4_2 ssse3 zisa | CHANGED | rc=0 >> avx avx2 sse sse2 sse4_1 sse4_2 ssse3 ```

jerinphilip commented 3 years ago
Click to expand

``` [2021-02-02 11:37:57] Error: Required option 'use-legacy-batching' has not been set [2021-02-02 11:37:57] Error: Aborted from T marian::Options::get(const char*) const [with T = bool] in /var/lib/jenkins/workspace/browsermt-marian-dev-cuda-10.2/src/common/options.h:134 [CALL STACK] [0x6ffd1e] bool marian::Options:: get (char const*) const + 0x26e [0xa0c380] marian::cpu::Backend:: configureDevice (std::shared_ptr) + 0xa0 [0x7094f0] marian::Rescore:: Rescore (std::shared_ptr) + 0x740 [0x70b0c9] std::shared_ptr> marian:: New ,std::shared_ptr&>(std::shared_ptr&) + 0x59 [0x67c602] main + 0x52 [0x7fd40cbb1840] __libc_start_main + 0xf0 [0x6aea29] _start + 0x29 test_scores_cpu.sh: line 18: 17405 Aborted (core dumped) $MRT_MARIAN/marian-scorer -c $MRT_MODELS/wmt16_systems/marian.en-de.scorer.yml --cpu-threads 2 -t $(pwd)/scores_cpu.src.in $(pwd)/scores_cpu.trg.in > scores_cpu.out ```

  1. tests/scorer/scores/test_scores_cpu.sh.log

@XapaJIaMnu (on slack): So this used to be the case that there are two wasy to do CBLAS_SGEMM with MKL. for the attention layer. Through a call of CBLAS_SGAMM_BATCHED or through a for loop with multiple CBLAS_SGEMM calls. Now since this project will use DNNL, the only available codepath is the the multiple CBLAS_SGEMM calls. During one of the merges with master, this option got added and removed by upstream so i assume that's where it got messed up

jerinphilip commented 3 years ago
  1. /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_16bit.sh.log
+ python3 /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/sacrebleu/sacrebleu.py newstest2018.ref
+ tee intgemm_16bit.out.bleu
BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.2.12 = 30.5 66.3/38.6/24.4/15.8 (BP = 0.968 ratio = 0.968 hyp_len = 2748 ref_len = 2838)
+ cat intgemm_16bit.avx.expected.bleu
BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.2.12 = 30.5 66.3/38.6/24.5/15.8 (BP = 0.967 ratio = 0.967 hyp_len = 2745 ref_len = 2838)
+ /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/diff.sh intgemm_16bit.out intgemm_16bit.avx.expected
Command: /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/diff.sh /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/intgemm_16bit.out /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/intgemm_16bit.avx.expected
14c14
< Ago Leis, head of the Central Criminal Police Service, said the arrests were preceded by a probe into a year-and-a-half year-and-a-half investigation.
---
> Ago Leis, head of the Central Criminal Police Service, said the arrests were preceded by a year-and-a-half probe.
28c28
< For example, the latest court rulings, eight defendants separated from the so-called Dikayev Criminal Association criminal case who were ordered to pay BGN 80,000 for the proceeds of criminal damage, or the judgment of nine individuals, in 2006 that Igor Aleynikov established a criminal association aimed at the illegal trade in cigarettes and the committing of crimes related to human trafficking in East Virginia and the South in Estonia.
---
> For example, the latest court rulings, eight defendants separated from the so-called Dikayev Criminal Association criminal case, who were ordered to pay BGN 80,000 for the proceeds of criminal damage, or the judgment of nine individuals, in 2006 that Igor Aleynikov established a criminal association aimed at the illegal trade in cigarettes and the committing of crimes related to human trafficking in East Virginia and the South in Estonia.

Why is this failing?

jerinphilip commented 3 years ago
  1. /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_16bit_sse2.sh.log
+ /var/lib/jenkins/workspace/browsermt-marian-regression-tests/marian-dev/build/marian-conv -f /var/lib/jenkins/workspace/browsermt-marian-regression-tests/models/student-eten/model.npz -t intgemm_16bit_sse2.avx.bin --gemm-type intgemm16sse2
[2021-02-02 11:54:06] Error: Unknown gemm-type: intgemm16sse2
[2021-02-02 11:54:06] Error: Aborted from int main(int, char**) in /var/lib/jenkins/workspace/browsermt-marian-dev-cuda-10.2/src/command/marian_conv.cpp:54

[CALL STACK]
[0x57b0b2]          main                                               + 0x1762
[0x7f8d8446a840]    __libc_start_main                                  + 0xf0
[0x59e8f9]          _start                                             + 0x29

test_intgemm_16bit_sse2.sh: line 37: 27191 Aborted                 (core dumped) $MRT_MARIAN/marian-conv -f $MRT_MODELS/student-eten/model.npz -t $prefix.$suffix.bin --gemm-type intgemm16sse2

This is a named parameter fail.

jerinphilip commented 3 years ago
  1. /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_8bit.sh.log
+ python3 /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/sacrebleu/sacrebleu.py newstest2018.ref
+ tee intgemm_8bit.out.bleu
BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.2.12 = 29.6 65.4/38.0/23.8/14.9 (BP = 0.966 ratio = 0.966 hyp_len = 2742 ref_len = 2838)
+ cat intgemm_8bit.avx.expected.bleu
BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.2.12 = 29.8 65.5/38.1/24.1/15.0 (BP = 0.968 ratio = 0.969 hyp_len = 2749 ref_len = 2838)
+ /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/diff.sh intgemm_8bit.out intgemm_8bit.avx.expected
Command: /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/diff.sh /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/intgemm_8bit.out /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/intgemm_8bit.avx.expected

Outputs are very different. 98 lines differ. Probably some gemm switch/feature to be enabled as a fix?

jerinphilip commented 3 years ago
  1. /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh.log
+ /var/lib/jenkins/workspace/browsermt-marian-regression-tests/marian-dev/build/marian-conv -f /var/lib/jenkins/workspace/browsermt-marian-regression-tests/models/student-eten/model.npz -t intgemm_8bit_ssse3.avx.bin --gemm-type intgemm8ssse3
[2021-02-02 11:54:15] Error: Unknown gemm-type: intgemm8ssse3
[2021-02-02 11:54:15] Error: Aborted from int main(int, char**) in /var/lib/jenkins/workspace/browsermt-marian-dev-cuda-10.2/src/command/marian_conv.cpp:54

[CALL STACK]
[0x57b0b2]          main                                               + 0x1762
[0x7f8e1417c840]    __libc_start_main                                  + 0xf0
[0x59e8f9]          _start                                             + 0x29

test_intgemm_8bit_ssse3.sh: line 37: 27310 Aborted                 (core dumped) $MRT_MARIAN/marian-conv -f $MRT_MODELS/student-eten/model.npz -t $prefix.$suffix.bin --gemm-type intgemm8ssse3

Another parameter fail.

jerinphilip commented 3 years ago
  1. /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/models/wnmt18/test_student_small_aan_intgemm16.sh.log
+ cat optimize_aan_16.out
+ perl -pe 's/@@ //g'
+ /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/moses-scripts/scripts/recaser/detruecase.perl
+ /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/extract-bleu.sh
+ /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/moses-scripts/scripts/generic/multi-bleu.perl newstest2014.ref
It is in-advisable to publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.
+ /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/diff-nums.py optimize_aan_16.bleu optimize_aan.bleu.expected -p 0.6 -o optimize_aan_16.bleu.diff
Command: /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tools/diff-nums.py /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/models/wnmt18/optimize_aan_16.bleu /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/models/wnmt18/optimize_aan.bleu.expected -p 0.6 -o /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/models/wnmt18/optimize_aan_16.bleu.diff
Line 1: 25.09 != 25.78
XapaJIaMnu commented 3 years ago

Regression tests will incompatible with upstream, they use a toned down feature level intgemm (THey don't pass the output layer through intgemm, we do). As such you can't get the same numbers as upstream tests, even if you match the architecture.

Some upstream gemm configurations are not available here. We use an architecture agnostic binary format, upstream has both architecture dependent and architecture agnostic.

jerinphilip commented 3 years ago

@kpu told me to compile what's happening, it's being done in this issue. What is a recommended fix so we can get rid of the build failure on all browsermt/* updates while keeping them separate?

We can afford to keep separate regression tests if that's what it takes. I'm fairly certain I'm lacking enough context to get to the bottom of these test failures.

XapaJIaMnu commented 3 years ago

Sooo basically, you need to rerun the test sets on the different machines (sse, avx2, avx512, avx512vnni), create gold standard references for those and then replace the old reference with those

jerinphilip commented 3 years ago

Sooo basically, you need to rerun the test sets on the different machines (sse, avx2, avx512, avx512vnni), create gold standard references for those and then replace the old reference with those

That sounds easy for places with diffs in expected vs outputs, something which I can do along setting up along with bergamot-translator tests.

What of the remaining command/argument failures? (1, 3, and 5)

XapaJIaMnu commented 3 years ago

Legacy batching, needs to be merged and fixed. Can you try the branch that I have proposed? the nonexistent intgemm options can be removed

jerinphilip commented 3 years ago

@XapaJIaMnu I tested the change, it's working. Didn't have to change tests, so --use-legacy-batching is default on?

XapaJIaMnu commented 3 years ago

Technically the results between the legacy and non legacy batching should be exactly the same. Since we are using dnnl, we only have the legacy code path available

On Sat, 6 Feb 2021, 22:51 Jerin Philip, notifications@github.com wrote:

@XapaJIaMnu https://github.com/XapaJIaMnu I tested the change, it's working. Didn't have to change tests, so --use-legacy-batching is default on?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/browsermt/marian-dev/issues/17#issuecomment-774555264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPO5VMA3HX3SJDEEYKOVW3S5XBVJANCNFSM4XBKLDPQ .

jerinphilip commented 2 years ago

Current status on lofn:

Skipped:
  - tests/decoder/align-ensemble/test_align_ensemble.sh
  - tests/decoder/align-ensemble/test_align_ensemble_beam_1.sh
  - tests/decoder/intgemm/test_intgemm_16bit_avx2.sh
  - tests/decoder/intgemm/test_intgemm_8bit_avx2.sh
  - tests/decoder/shortlist/test_shortlist_server.sh
  - tests/examples/iris/test_iris.sh
  - tests/examples/mnist/test_mnist_ffnn.sh
  - tests/interface/input-tsv/test_tsv_server.sh
  - tests/interface/input-tsv/test_tsv_server_dual_source.sh
  - tests/models/wngt19/test_model_base_fbgemm_packed16.sh
  - tests/models/wngt19/test_model_base_fbgemm_packed8.sh
  - tests/server/test_ende.sh
  - tests/server/test_ende_align.sh
  - tests/server/test_ende_batch32.sh
  - tests/server/test_ende_cpu.sh
  - tests/server/test_ende_with_empty_lines.sh
  - tests/training/features/exp-smoothing/test_expsmooth_sync.sh
  - tests/training/multi-gpu/test_async_sgd_runs.sh
  - tests/training/multi-gpu/test_sync_sgd.sh
  - tests/training/restoring/exp-smoothing/test_expsmooth_sync.sh
  - tests/training/restoring/multi-gpu/test_adam_sync.sh
  - tests/training/restoring/multi-gpu/test_async.sh
  - tests/training/restoring/multi-gpu/test_sync.sh
  - tests/training/restoring/optimizer/test_adam_params_async.sh
  - tests/training/restoring/optimizer/test_adam_params_sync.sh
Failed:
  - tests/decoder/align/test_align.sh
  - tests/decoder/align/test_align_beam_1.sh
  - tests/decoder/align/test_align_beam_1_batched.sh
  - tests/decoder/align/test_align_cpu.sh
  - tests/decoder/align/test_align_nbest.sh
  - tests/decoder/align/test_align_threshold.sh
  - tests/decoder/align/test_soft_align.sh
  - tests/decoder/align/test_soft_align_nbest.sh
  - tests/decoder/intgemm/test_intgemm_16bit.sh
  - tests/decoder/intgemm/test_intgemm_16bit_sse2.sh
  - tests/decoder/intgemm/test_intgemm_8bit.sh
  - tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh
  - tests/decoder/wmt16/test_ende.sh
  - tests/decoder/wmt16/test_ende_cpu.sh
  - tests/decoder/wmt16/test_ende_logs.sh
  - tests/decoder/wmt16/test_nbest.sh
  - tests/decoder/word-scores/test_word_scores.sh
  - tests/decoder/word-scores/test_word_scores_batch.sh
  - tests/decoder/word-scores/test_word_scores_ensemble.sh
  - tests/decoder/word-scores/test_word_scores_nbest.sh
  - tests/decoder/word-scores/test_word_scores_nbest_with_align.sh
  - tests/decoder/word-scores/test_word_scores_normalized.sh
  - tests/examples/unit-tests/test_unit_tests.sh
  - tests/interface/config/test_dump_config_with_relative_paths.sh
  - tests/interface/config/test_relative_paths.sh
  - tests/interface/config/test_relative_paths_apply_only_to_config_files.sh
  - tests/interface/config/test_relative_paths_are_not_applied_to_cmd_options.sh
  - tests/interface/config/test_relative_paths_for_each_config_file.sh
  - tests/interface/config/test_relative_paths_for_input_in_config_file.sh
  - tests/interface/envvars/test_interpolate_envvars.sh
  - tests/interface/input/test_empty_file.sh
  - tests/interface/version/test_no_version_from_old_models.sh
  - tests/models/wmt16-ende/test_translation_b6n.sh
  - tests/models/wmt16-ende/test_translation_b6n_batch32.sh
  - tests/models/wmt16-ende/test_translation_b6n_batch64.sh
  - tests/models/wnmt18/test_student_small_aan_intgemm16.sh
  - tests/scorer/align/test_scorer_align.sh
  - tests/scorer/align/test_scorer_align_batch_1.sh
  - tests/scorer/align/test_scorer_align_nbest.sh
  - tests/scorer/align/test_scorer_soft_align.sh
  - tests/scorer/nbest/test_compare_parallel_and_nbest.sh
  - tests/scorer/nbest/test_custom_feature_name.sh
  - tests/scorer/nbest/test_score_nbest_list.sh
  - tests/scorer/scores/test_compare_with_decoder_scores.sh
  - tests/scorer/scores/test_scores.sh
  - tests/scorer/scores/test_scores_cpu.sh
  - tests/scorer/scores/test_scores_normalized.sh
  - tests/scorer/scores/test_summary.sh
  - tests/scorer/scores/test_summary_perplexity.sh
  - tests/scorer/scores/test_word_scores.sh
  - tests/scorer/scores/test_word_scores_mini_batch_1.sh
  - tests/scorer/scores/test_word_scores_nbest.sh
  - tests/scorer/scores/test_word_scores_normalized.sh
  - tests/training/features/guided-alignment/test_guided_alignment_rnn.sh
  - tests/training/features/guided-alignment/test_guided_alignment_transformer.sh
  - tests/training/features/guided-alignment/test_guided_alignment_transformer_sync.sh
  - tests/training/restarting/test_restarting_finished.sh
---------------------
Ran 82 tests in 00:01:0.497s, 0 passed, 25 skipped, 57 failed

Some appear due to changes in the model archives where files have gone missing.