Open coldmanck opened 2 years ago
You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs"; "Your Environment";
@ppwwyyxx I sincerely hope that you and the developers can have a look at my issue. Thank you :)
Follow-up: my issue was solved by implementing my own file_lock()
function which modifies this line by increasing the timeout
to 60*60*48
which means 48 hours, as my dataset needs to process for around 24 hours. I think maybe we shouldn't set a timeout (or, for example, set timeout=float('inf')
).
But I'm still wondering if this way is the right way.
You're right about the issue relating to timeout. We didn't anticipate that the dataset generation can take more than an hour.
It sounds like a good solution is to make the timeout of file_lock
an argument, and call it with a larger timeout value (or infinite, if possible) from detectron2.
You're right about the issue relating to timeout. We didn't anticipate that the dataset generation can take more than an hour.
It sounds like a good solution is to make the timeout of
file_lock
an argument, and call it with a larger timeout value (or infinite, if possible) from detectron2.
Thanks for the reply. Yes, I do think the timeout=3600
value should be configurable.
If no one is working on it, can I make a PR @ppwwyyxx @coldmanck ?
Main Issues
I've modified
plain_train_net.py
to inference my larger-scale dataset which is around 2.6 million images with pre-trained object detection models. When I run with smaller number of data (say, 1000 imgs), the code works well. However, when I ran with my full dataset it encountered an error,portalocker.exceptions.LockException: [Errno 11] Resource temporarily unavailable
. Full error message is as follows.This happened during the step of Trying to convert 'my_dataset' to COCO format ...
Basically, I do not understand what does this error message means, especially for my (larger-scale) dataset. I found that this error happened around 60-70 mins after the above prompt Trying to convert 'my_dataset' to COCO format ... logged on my screen. I did some simple researches and I found that it might be because of the
timeout=3600
set for portalocker.Lock() in Facebook's iopath. So I suspect that it means I got a timeout error where one of my threads did something with the json file with the lock (.json.lock
file) for more than one hour. If that's the case, isn't this timeout an unreasonable choice? How should I fix my issue? Thank you very much!P.s. I found my error is almost the same as this faq; however, I don't think the solution doesn't help me as I run the command from the beginning, i.e., with no other experiments & without a json.lock file.
Instructions To Reproduce the Issue and Full Logs
What I did are only 1) write a custom function
get_my_dataset_dicts()
inplain_train_net.py
and 2) ran it with 8 V100 gpus to collect the inference results (bounding boxes & scores).Your Environment