Reproduce e2e NER experiments

thomasnguyen92 commented 9 months ago

I attempted to replicate a Named Entity Recognition (NER) experiment but encountered several issues during the process.

Firstly, when executing the command python slue_toolkit/prepare/prepare_voxpopuli_nel.py create_manifest to generate manifest files, I noticed that the dev.tsv, fine-tune.tsv, and test.tsv files were merely symbolic links. They were unusable for running the end-to-end NER model. To resolve this, I had to manually copy dev.tsv and fine-tune.tsv from slue-toolkit/manifest/slue-voxpopuli into the e2e_ner directory.

Additionally, I faced a problem while performing evaluations with the command bash baselines/ner/e2e_scripts/eval-ner.sh w2v2-base test combined nolm. It appears that the processed test files are missing. Could you provide guidance on how to properly prepare these files for evaluation?

ankitapasad commented 8 months ago

Hi @thomasnguyen92

Thank you for your interest in our work!

The test data is not public yet. We'll update the repo when we make it public (soon). Until then, if you'd like the test set evaluated, you can follow the instructions here.

Can you point out the specific step/script that gave you trouble because of the symbolic links?

maherr13 commented 7 months ago

I can provide the steps as i faced the same problem

following the steps as mentioned after python slue_toolkit/prepare/prepare_voxpopuli_nel.py create_manifest

when you run the cmd bash baselines/ner/e2e_scripts/ft-w2v2-base.sh manifest/slue-voxpopuli/e2e_ner save/e2e_ner/w2v2-base

you would get the following error

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 347, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 107, in run
    return run_job(
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 129, in run_job
    ret.return_value = task_function(task_cfg)
  File "/content/slue-toolkit/baselines/ner/e2e_scripts/fairseq/fairseq_cli/hydra_train.py", line 27, in hydra_main
    _hydra_main(cfg)
  File "/content/slue-toolkit/baselines/ner/e2e_scripts/fairseq/fairseq_cli/hydra_train.py", line 56, in _hydra_main
    distributed_utils.call_main(cfg, pre_main, **kwargs)
  File "/content/slue-toolkit/baselines/ner/e2e_scripts/fairseq/fairseq/distributed/utils.py", line 404, in call_main
    main(cfg, **kwargs)
  File "/content/slue-toolkit/baselines/ner/e2e_scripts/fairseq/fairseq_cli/train.py", line 134, in main
    task.load_dataset(valid_sub_split, combine=False, epoch=1)
  File "/content/slue-toolkit/baselines/ner/e2e_scripts/fairseq/fairseq/tasks/audio_finetuning.py", line 140, in load_dataset
    super().load_dataset(split, task_cfg, **kwargs)
  File "/content/slue-toolkit/baselines/ner/e2e_scripts/fairseq/fairseq/tasks/audio_pretraining.py", line 153, in load_dataset
    self.datasets[split] = FileAudioDataset(
  File "/content/slue-toolkit/baselines/ner/e2e_scripts/fairseq/fairseq/data/audio/raw_audio_dataset.py", line 269, in __init__
    with open(manifest_path, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/content/manifest/slue-voxpopuli/e2e_ner/dev.tsv'

even if the file exists as a symbolic link.

asappresearch / slue-toolkit

Reproduce e2e NER experiments #40