Closed ryalberti closed 1 year ago
Thanks @ryalberti for reporting the issue. A pull request created to solve the issue.
Resolved in #242
The issue with cats-dogs evaluate is fixed (thank you!) however I am still receiving the issue with the FPN "TypeError: cannot pickle 'dict_keys' object" and "EOFError: Ran out of input" errors for training and evaluation. Attached are my outputs. They are the same as above since only the catsdogs was affected. The FPN version did not get far enough through train.py to hit the assert error. I repeated the same process in my fresh install of the environment and received the same results.
@ryalberti, are you on the "develop" branch? The problem seems to be related to not having a GPU in the system, but it's not clear under what circumstances it's triggered.
Could you try "develop" please? The problem may persist but there are a number of fixes and any new changes will be based on "develop".
Sorry, screenshot for the incorrect repo. I have been using develop and I am not using a GPU
Please try calling the script with "--workers 0"
Could you also check your python version, please? Your environment might be broken somehow.
I entered the --workers command around 30 min ago and I am still stuck at this point. I have also pressed 'enter' occasionally as I know it can sometimes stall if it isn't pressed.
Additionally, I am using Python 3.8.11 in the 3 environments I have been testing this in.
I entered the --workers command around 30 min ago and I am still stuck at this point. I have also pressed 'enter' occasionally as I know it can sometimes stall if it isn't pressed.
It takes a long time without GPU. I don't know whether it's practical, but please let it run for a while. Do you see at least one core being used 100% in the Windows task scheduler/process monitor?
Yep! The utilization is hovering around 95-100% with the occasional drop 85%. I have no problem running it all day. In past trainings on this computer, I've let the training process run for 2 weeks with minimal issues.
We think we found the issue that required --workers 0 on Windows and macOS. I don't think it will make training any faster without a GPU, but you could try https://github.com/MaximIntegratedAI/ai8x-training/pull/243 Reference the training log in ai8x-synthesis trained/ai87-pascalvoc-fpndetector-qat8.log Only after epoch 2 will there be an mAP better than 0, so it might take a long time to see whether it's working as intended.
I've run the training over the weekend with the update and --workers 0. Attached are my logs so far. Do these results match the expected?
Yes this looks good. It's learning and losses and mAP are reasonable for these early epochs. Compare to trained/ai87-pascalvoc-fpndetector-qat8.log in the ai8x-synthesis repo. The only observation is that each block of 50 shows "Time 72" or even higher for you, where it only takes 0.20... with a CUDA GPU (>350x faster).
Thanks so much! I'm glad to have these results figured out.
I am trying to do a test train of the feature pyramid network model and I am receiving assert errors. I am receiving one for the associated train and evaluation scripts. Screenshots of the outputs below. The extra line below "Training epoch: ...." is print(train_loader) since I initially assumed the issue was there.
For evaluate_catsdogs, I receive the same assert error, however I do not have any errors when training.
I initially did this in my preexisting Anaconda train environment (training) using 3.8.11 after updating with the new requirements.txt. When I received this error, I made a fresh install of the repo in a new directory and used a new Anaconda environment using python 3.8.11. I installed the requirements.txt and made sure to run the install for distiller as well. I am working on native Windows 10 without CUDA enabled. I am also using the newest update of the toolchain MaximSDK (updated as of today, 7/7/23) and I am working with the rest of the msdk from the Develop repo. I see that other people have posted about using FPN so I am unsure where this could be happening since I have tried a clean install.
Thank you!