How to get dimsum_train_dev.jsonl

Yusuke196 commented 4 months ago

Hi, I'm trying to run the bi-encoder following the README.

PYTHONPATH=. python scripts/training/train.py \
  --max_epochs 15 \
  --batch_size 16 \
  --accumulate_grad_batches 2 \
  --gpus 1 \
  --swa true \
  --gradient_clip_val 1.0 \
  --lr 0.00001 \
  --run_name replicate-top \
  --encoder bert-base-uncased \
  --enable_checkpointing true \
  --mwe_processing true \
  --train_data_suffix fixed.annotated.autoneg

And hit the following error.

Traceback (most recent call last):
  File "/project/yusuke-i/ghq/github.com/Mindful/MWEasWSD/scripts/training/train.py", line 245, in <module>
    main()
  File "/project/yusuke-i/ghq/github.com/Mindful/MWEasWSD/scripts/training/train.py", line 158, in main
    model.setup_for_train_eval(datasets.manager, mwe_eval=mwe_pipelines)
  File "/project/yusuke-i/ghq/github.com/Mindful/MWEasWSD/resolve/model/pl_module.py", line 49, in setup_for_train_eval
    self.mwe_eval_pipelines = [
  File "/project/yusuke-i/ghq/github.com/Mindful/MWEasWSD/resolve/model/pl_module.py", line 50, in <listcomp>
    (eval_data.value, eval_data.get_evaluator(model=self))
  File "/project/yusuke-i/ghq/github.com/Mindful/MWEasWSD/resolve/training/mwe_eval.py", line 461, in get_evaluator
    data = list(data)
  File "/project/yusuke-i/ghq/github.com/Mindful/MWEasWSD/resolve/training/data.py", line 63, in read_training_sentences
    line_count = fast_linecount(str_path)
  File "/project/yusuke-i/ghq/github.com/Mindful/MWEasWSD/resolve/training/data.py", line 33, in fast_linecount
    with open(filename, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/project/yusuke-i/ghq/github.com/Mindful/MWEasWSD/data/dimsum_train_dev.jsonl'

I think I ran all the preprocessing but dimsum_train_dev.jsonl was not generated. How can I get it?

Yusuke196 commented 4 months ago

Temporarily, I'm setting:

    parser.add_argument(
        '--mwe_eval_pipelines',
        type=MWEEvalData,
        nargs='+',
        default=[MWEEvalData.DIMSUM_TEST, MWEEvalData.CUPT_TEST],
    )

to make the codes run.

This may suffice for my purpose. Thanks.

Mindful commented 4 months ago

First of all, you're right that this is missing from the readme - I apologize for that, I seem to have totally overlooked it.

For context, this data is only used as dev data; you could run the training without it and it would work fine, you would just be missing some metrics. I tried to look back through my code and remember how I generated this data and I don't immediately see how. I might be able to figure it out if I spent more time on it, but I found a copy of the data processed dimsum data I used on an old machine so I think the easiest way to solve this is for me to just share that.

dimsum_proc.zip

Mindful / MWEasWSD

How to get dimsum_train_dev.jsonl #1