Closed jaketae closed 3 years ago
While I've checked that the script works as intended, the tokenizer complains that it has been forked by multiprocessing, which is presumably coming from the Dataloader
and its num_workers
argument.
(venv) jaketae:evaluation (wmt2) $ python3 -m evaluation.eval --model_name_or_path=gpt2 --eval_tasks wmt --output_dir outputs
08/19/2021 21:09:49 - evaluation - INFO - Beginning evaluation on device cpu
08/19/2021 21:09:49 - evaluation - INFO - Loading model...
08/19/2021 21:10:01 - evaluation - INFO - Benchmarking wmt...
Reusing dataset wmt19 (/Users/jaketae/.cache/huggingface/datasets/wmt19/kk-en/1.0.0/fae232cf0c13b62b26731bafce0810bc652fb5799189790bed836db0cee28056)
0%| | 0/14 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
The script continues with this warning, but it is worthy of note. Also, as somewhat expected, this warning does not appear if I do not pass any num_worker
arguments into the Dataloader
.
Also, as somewhat expected, this warning does not appear if I do not pass any
num_worker
arguments into theDataloader
.
I am going with this solution for now, but I did leave a TODO note for posterity. I can see this becoming a problem, since the tokenizer
+ Dataloader
combo will probably be used in many parts of the repo.
While I've checked that the script works as intended, the tokenizer complains that it has been forked by multiprocessing, which is presumably coming from the
Dataloader
and itsnum_workers
argument.(venv) jaketae:evaluation (wmt2) $ python3 -m evaluation.eval --model_name_or_path=gpt2 --eval_tasks wmt --output_dir outputs 08/19/2021 21:09:49 - evaluation - INFO - Beginning evaluation on device cpu 08/19/2021 21:09:49 - evaluation - INFO - Loading model... 08/19/2021 21:10:01 - evaluation - INFO - Benchmarking wmt... Reusing dataset wmt19 (/Users/jaketae/.cache/huggingface/datasets/wmt19/kk-en/1.0.0/fae232cf0c13b62b26731bafce0810bc652fb5799189790bed836db0cee28056) 0%| | 0/14 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) [W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) [W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
The script continues with this warning, but it is worthy of note. Also, as somewhat expected, this warning does not appear if I do not pass any
num_worker
arguments into theDataloader
.
Just a side note that you guys may already be aware of it. Besides the dedicated issues from HuggingFace repos, there's an interesting comment on an issue from AllenNLP: https://github.com/allenai/allennlp/issues/5128#issuecomment-830136469
(I totally forgot to comment on this until I ran into it again today.)
Hey @tianjianjiang, thanks for the heads up. So what I'm getting out of that thread are:
AFAIK, there is no safe and guaranteed way of setting an environment variable within a Python script, which means the only way to suppress the warning is for the user to set this variable in the terminal process before they run the evaluation. Is this the correct interpretation, or are there other alternatives?
Hey @jaketae
Apologies for the late reply.
- The warning is fine and expected.
As long as there's no deadlock then I think so. Or a minor consideration: if just doing hyperparameter tuning, I prefer it to be fully deterministic for a fair comparison, such that no multiprocessing/threading will happen in my experiments.
- To suppress the warning, an environmental variable has to be set.
I guess it may also be possible that we reexamine when we call the tokenizer.
AFAIK, there is no safe and guaranteed way of setting an environment variable within a Python script, which means the only way to suppress the warning is for the user to set this variable in the terminal process before they run the evaluation. Is this the correct interpretation, or are there other alternatives?
I think so. Even if we try to use os.environ
, in theory it still isn't bullet-proof.
I suppose the only true alternative is ensuring a perferrable call order if at all possible. Sometimes it isn't possible and yet it might be fine.
This PR supersedes #42 by creating a
WMTTask
class that inherits from fromAutoTask
. All functionality in the previous PR have been carried over. For convenience, the PR description has been copied and pasted from the original thread.This PR implements the following:
stride
tokens given as contextEstimated runtime on GPU is ~1 minute; CPU, ~10 minutes.