Closed akafen closed 3 years ago
Hi @akafen , Thanks for pointing this issue, I'm currently fixing the bugs for model training and I hope I can give you an updated code soon.
@akafen Can you specify the infrastructure specs that you are trying to run the setup ? Like the GPUs and the memory .
@NomadXD Of course I can specify the infrastructure specs that I am trying to run the setup.I am using one gpu to run the code
-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A | | 19% 35C P0 60W / 250W | 0MiB / 12196MiB | 0% Default | +-------------------------------+----------------------+----------------------+
But I think the error is not about the infrastructure specs."Max sentences" is not the arguments of fairseq-train is the error.From the error information:
train_models.py: error: unrecognized arguments: --max-sentences 64
"Max sentences" is an unrecognized arguments
Yes it's due to a problem in the fairseq version. I'm trying to find a solution to that right now :)
@louismartin I am using fairseq==0.10.2,so I should remove --max-sentences argument and change it to batch_size?
Yes that is the solution that I just pushed, I also made the training script simpler if you want to train a single model, maybe that can help as well. Tell me if that works well on your end and we can close the issue.
@louismartin Any plans on porting this to pure PyTorch ? If ported to pytorch, maybe it will be more accessible to people who don't have knowledge about fairseq ?
Hi @Atharva-Phatak ,
Thanks for the message, fairseq uses pytorch. There is no plan do use something else.
@akafen did it solve your issue?
I'm closing the issue but feel free to open a new issue if you have further questions or problems.
I change cluster "local" to "debug" in scripts/train_model.py and I run the command "python3 scripts/train_models.py' ,but fail The error :
The code:
When cluster is "local" ,train fail too