Closed Kai-Piontek closed 2 months ago
did you try to add --compute_dtype fp16
I just tried:
eole: error: unrecognized arguments: --compute_dtype fp16
Full command: eole predict --src wmt17_en_de/test.src.bpe --model_path wmt17_en_de/bigwmt17/step_50000 --beam_size 5 --batch_size 4096 --batch_type tokens --output wmt17_en_de/pred.trg.bpe --gpu 0 --compute_dtype fp16
taking an example like this one: https://github.com/eole-nlp/eole/blob/main/recipes/llama2/llama-inference.yaml can you try to infer using a yaml file ?
Dear Vincent
Thank you for pointing me to the yaml file. I got it to work using: --precision fp16 -gpu_ranks 0
eole predict --src wmt17_en_de/test.src.bpe --model_path wmt17_en_de/bigwmt17/step_50000 --beam_size 5 --batch_size 4096 --batch_type tokens --output wmt17_en_de/pred.trg.bpe --gpu 0 --precision fp16 -gpu_ranks 0
Maybe it would be good to add it to the recipe, in case I am not the only one encountering it.
Love this project btw and I am happy that it is working now :)
oh it means you are not on the last main branch
see this: https://github.com/eole-nlp/eole/pull/54
Dear Vince
Yes you were right. I pulled the docker image and it was not on the latest branch. I tried to repull the image but it was still not on the latest branch, so I logged into the docker container bash and cloned the eole repo again. Now everything works fine.
Best, Kai
Indeed, the docker image had not been updated in quite a while. @Kai-Piontek you can check out the 0.0.2 images that were just built if needed!
Thank you Francois for fixing #48, I can now train the model successfully.
But when I want to run inference with eole predict --src wmt17_en_de/test.src.bpe --model_path wmt17_en_de/bigwmt17/step_50000 --beam_size 5 --batch_size 4096 --batch_type tokens --output wmt17_en_de/pred.trg.bpe --gpu 0
I get the error below. What might be wrong?
Best, Kai
[2024-09-08 18:23:26,170 INFO] Loading checkpoint from wmt17_en_de/bigwmt17/step_50000 [2024-09-08 18:23:26,893 INFO] Building model... [2024-09-08 18:23:26,894 WARNING] You have a CUDA device, should run with -gpu_ranks [2024-09-08 18:23:26,894 WARNING] You have a CUDA device, should run with -gpu_ranks [2024-09-08 18:23:26,894 WARNING] You have a CUDA device, should run with -gpu_ranks [2024-09-08 18:23:26,894 WARNING] You have a CUDA device, should run with -gpu_ranks [2024-09-08 18:23:26,894 WARNING] You have a CUDA device, should run with -gpu_ranks [2024-09-08 18:23:26,894 WARNING] You have a CUDA device, should run with -gpu_ranks [2024-09-08 18:23:26,901 WARNING] You have a CUDA device, should run with -gpu_ranks [2024-09-08 18:23:26,925 INFO] Loading data into the model [2024-09-08 18:23:27,136 INFO] Transforms applied: [] Traceback (most recent call last): File "/usr/local/bin/eole", line 33, in
sys.exit(load_entry_point('EOLE', 'console_scripts', 'eole')())
File "/eole/eole/bin/main.py", line 39, in main
bincls.run(args)
File "/eole/eole/bin/run/predict.py", line 42, in run
predict(config)
File "/eole/eole/bin/run/predict.py", line 18, in predict
, , = engine.infer_file()
File "/eole/eole/inference_engine.py", line 37, in infer_file
scores, estims, preds = self._predict(infer_iter)
File "/eole/eole/inference_engine.py", line 163, in _predict
scores, estims, preds = self.predictor._predict(
File "/eole/eole/predict/inference.py", line 454, in _predict
batch_data = self.predict_batch(batch, attn_debug)
File "/eole/eole/predict/translator.py", line 121, in predict_batch
return self._translate_batch_with_strategy(batch, decode_strategy)
File "/eole/eole/predict/translator.py", line 194, in _translate_batch_with_strategy
log_probs, attn = self._decode_and_generate(
File "/eole/eole/predict/inference.py", line 664, in _decode_and_generate
dec_out, dec_attn = self.model.decoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
File "/eole/eole/decoders/transformer_decoder.py", line 200, in forward
emb, attn, attn_align = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, kwargs)
File "/eole/eole/decoders/transformer_base.py", line 76, in forward
layer_out, attns = self._forward(*args, *kwargs)
File "/eole/eole/decoders/transformer_decoder.py", line 95, in _forward
selfattn, = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
File "/eole/eole/modules/multi_headed_attn.py", line 690, in forward
return super()._forward2(
File "/eole/eole/modules/multi_headed_attn.py", line 461, in _forward2
attn_output = self.flash_attn_func(
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 831, in flash_attn_func
return FlashAttnFunc.apply(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
return super().apply(args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 511, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only support fp16 and bf16 data type