Closed cantabile-kwok closed 1 year ago
By the way, as I'm using a server with slurm job scheduler, if I submit a job to run the training program on a remote node, I gives me this error message:
Traceback (most recent call last):
File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/train.py", line 130, in <module>
main()
File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/train.py", line 121, in main
trainer.train(
File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 143, in train
command = _non_blocking_input()
File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 91, in _non_blocking_input
selector = _get_stdin_selector()
File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 82, in _get_stdin_selector
selector.register(fileobj=sys.stdin, events=selectors.EVENT_READ)
File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/selectors.py", line 360, in register
self._selector.register(key.fd, poller_events)
PermissionError: [Errno 1] Operation not permitted
I suppose this is because of the non-blocking stdin that does not have permission on a remote end. This feature is fancy, but how can I turn it off?
Alright, I guess the initial problem that program gets stuck is simply because the model has already been at 1000 steps which is a maximum. But I still have the problem with non-blocking input. Looking forward to any help!
you should type the 'quit' if you wanted to out in this process. maybe you can check the config for your maximum step!
Update: The initial problem is because the training process has already reached its maximum step. Then I deleted the non-blocking inputs so that it can run on remote servers.
By the way, as I'm using a server with slurm job scheduler, if I submit a job to run the training program on a remote node, I gives me this error message:
Traceback (most recent call last): File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/train.py", line 130, in <module> main() File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/train.py", line 121, in main trainer.train( File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 143, in train command = _non_blocking_input() File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 91, in _non_blocking_input selector = _get_stdin_selector() File "/mnt/lustre/sjtu/home/ywg12/remote/code/vall-e/vall_e/utils/trainer.py", line 82, in _get_stdin_selector selector.register(fileobj=sys.stdin, events=selectors.EVENT_READ) File "/mnt/lustre/sjtu/home/ywg12/.conda/envs/valle/lib/python3.10/selectors.py", line 360, in register self._selector.register(key.fd, poller_events) PermissionError: [Errno 1] Operation not permitted
I suppose this is because of the non-blocking stdin that does not have permission on a remote end. This feature is fancy, but how can I turn it off?
have you figured out a solution for this?
I did some modification to the code. Specifically I deleted everything related to that non-blocking stdin. I remember that changing one file is necessary. @MajoRoth
Hi and thanks for the great work! I have finished all the preliminary steps and uses
python -m vall_e.train yaml=config/test/ar.yml
to train. It outputs something like this:Then it somehow stuck there forever. It kept stuck no matter what I pressed. If I Ctrl-C, the program just quits with no error message. This is strange as I would never know where the program halts and how long it will leave me waiting.