cmb-chula / pylon

Official implementation of Pyramid Localization Network (PYLON).
Apache License 2.0
26 stars 9 forks source link

Stuck on 'Queue file' while training #1

Closed jimz7 closed 8 months ago

jimz7 commented 2 years ago

Hello, really appreciate the code release and the amazing results! I followed the instruction and trying to run python train_nih_run.py, and for the first few times, I get some training results successfully and my /log directory got updated. But after that when I runing the training command, I always stuck on the 'Queue file' step, and it seems that the training process never started and the /log directory never got updated. How can I solve this problems? Really appreciate the help! image

phizaz commented 2 years ago

mlkit.queue.<n> files were created as a part of parallelization. Each process will acquire one "lock file". If it can lock on a file, it will proceed. If it cannot, it will wait for the lock file to be acquired. Usually, the lock file will be released after the owner process ends. However, if the owner process didn't stop properly, it will fail to release the lock file, never releasing it.

To solve this, you can remove mlkit.queue.<n> files from ~ manually.

jimz7 commented 2 years ago

Thanks!