Add WMT dataset - Githubissues

jaketae commented 3 years ago

This PR supersedes #42 by creating a WMTTask class that inherits from from AutoTask. All functionality in the previous PR have been carried over. For convenience, the PR description has been copied and pasted from the original thread.

This PR implements the following:

Load the English validation portion of WMT 19
Load an AR model to run CLM, with stride tokens given as context
Report perplexity, normalized by the number of tokens

Estimated runtime on GPU is ~1 minute; CPU, ~10 minutes.

jaketae commented 3 years ago

While I've checked that the script works as intended, the tokenizer complains that it has been forked by multiprocessing, which is presumably coming from the Dataloader and its num_workers argument.

(venv) jaketae:evaluation (wmt2) $ python3 -m evaluation.eval  --model_name_or_path=gpt2  --eval_tasks wmt --output_dir outputs
08/19/2021 21:09:49 - evaluation - INFO - Beginning evaluation on device cpu
08/19/2021 21:09:49 - evaluation - INFO - Loading model...
08/19/2021 21:10:01 - evaluation - INFO - Benchmarking wmt...
Reusing dataset wmt19 (/Users/jaketae/.cache/huggingface/datasets/wmt19/kk-en/1.0.0/fae232cf0c13b62b26731bafce0810bc652fb5799189790bed836db0cee28056)
  0%|                                                                                                     | 0/14 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)

The script continues with this warning, but it is worthy of note. Also, as somewhat expected, this warning does not appear if I do not pass any num_worker arguments into the Dataloader.

jaketae commented 3 years ago

Also, as somewhat expected, this warning does not appear if I do not pass any num_worker arguments into the Dataloader.

I am going with this solution for now, but I did leave a TODO note for posterity. I can see this becoming a problem, since the tokenizer + Dataloader combo will probably be used in many parts of the repo.

tianjianjiang commented 3 years ago

While I've checked that the script works as intended, the tokenizer complains that it has been forked by multiprocessing, which is presumably coming from the Dataloader and its num_workers argument.

(venv) jaketae:evaluation (wmt2) $ python3 -m evaluation.eval  --model_name_or_path=gpt2  --eval_tasks wmt --output_dir outputs
08/19/2021 21:09:49 - evaluation - INFO - Beginning evaluation on device cpu
08/19/2021 21:09:49 - evaluation - INFO - Loading model...
08/19/2021 21:10:01 - evaluation - INFO - Benchmarking wmt...
Reusing dataset wmt19 (/Users/jaketae/.cache/huggingface/datasets/wmt19/kk-en/1.0.0/fae232cf0c13b62b26731bafce0810bc652fb5799189790bed836db0cee28056)
  0%|                                                                                                     | 0/14 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)

The script continues with this warning, but it is worthy of note. Also, as somewhat expected, this warning does not appear if I do not pass any num_worker arguments into the Dataloader.

Just a side note that you guys may already be aware of it. Besides the dedicated issues from HuggingFace repos, there's an interesting comment on an issue from AllenNLP: https://github.com/allenai/allennlp/issues/5128#issuecomment-830136469

(I totally forgot to comment on this until I ran into it again today.)

jaketae commented 3 years ago

Hey @tianjianjiang, thanks for the heads up. So what I'm getting out of that thread are:

The warning is fine and expected.
To suppress the warning, an environmental variable has to be set.

AFAIK, there is no safe and guaranteed way of setting an environment variable within a Python script, which means the only way to suppress the warning is for the user to set this variable in the terminal process before they run the evaluation. Is this the correct interpretation, or are there other alternatives?

tianjianjiang commented 3 years ago

Hey @jaketae

Apologies for the late reply.

The warning is fine and expected.

As long as there's no deadlock then I think so. Or a minor consideration: if just doing hyperparameter tuning, I prefer it to be fully deterministic for a fair comparison, such that no multiprocessing/threading will happen in my experiments.

To suppress the warning, an environmental variable has to be set.

I guess it may also be possible that we reexamine when we call the tokenizer.

AFAIK, there is no safe and guaranteed way of setting an environment variable within a Python script, which means the only way to suppress the warning is for the user to set this variable in the terminal process before they run the evaluation. Is this the correct interpretation, or are there other alternatives?

I think so. Even if we try to use os.environ, in theory it still isn't bullet-proof.

I suppose the only true alternative is ensuring a perferrable call order if at all possible. Sometimes it isn't possible and yet it might be fine.

bigscience-workshop / evaluation

Add WMT dataset #58