analogdevicesinc / ai8x-training

Model Training for ADI's MAX78000 and MAX78002 Edge AI Devices
Apache License 2.0
93 stars 87 forks source link

FPN Example Training and Evaluating Errors #239

Closed ryalberti closed 1 year ago

ryalberti commented 1 year ago

I am trying to do a test train of the feature pyramid network model and I am receiving assert errors. I am receiving one for the associated train and evaluation scripts. Screenshots of the outputs below. The extra line below "Training epoch: ...." is print(train_loader) since I initially assumed the issue was there. image image image

For evaluate_catsdogs, I receive the same assert error, however I do not have any errors when training. image image

I initially did this in my preexisting Anaconda train environment (training) using 3.8.11 after updating with the new requirements.txt. When I received this error, I made a fresh install of the repo in a new directory and used a new Anaconda environment using python 3.8.11. I installed the requirements.txt and made sure to run the install for distiller as well. I am working on native Windows 10 without CUDA enabled. I am also using the newest update of the toolchain MaximSDK (updated as of today, 7/7/23) and I am working with the rest of the msdk from the Develop repo. I see that other people have posted about using FPN so I am unsure where this could be happening since I have tried a clean install.

Thank you!

ermanok commented 1 year ago

Thanks @ryalberti for reporting the issue. A pull request created to solve the issue.

rotx-eva commented 1 year ago

Resolved in #242

ryalberti commented 1 year ago

The issue with cats-dogs evaluate is fixed (thank you!) however I am still receiving the issue with the FPN "TypeError: cannot pickle 'dict_keys' object" and "EOFError: Ran out of input" errors for training and evaluation. Attached are my outputs. They are the same as above since only the catsdogs was affected. The FPN version did not get far enough through train.py to hit the assert error. I repeated the same process in my fresh install of the environment and received the same results.

training_issue_07.24.23.txt

rotx-eva commented 1 year ago

@ryalberti, are you on the "develop" branch? The problem seems to be related to not having a GPU in the system, but it's not clear under what circumstances it's triggered.

rotx-eva commented 1 year ago

Could you try "develop" please? The problem may persist but there are a number of fixes and any new changes will be based on "develop".

ryalberti commented 1 year ago

Sorry, screenshot for the incorrect repo. I have been using develop and I am not using a GPU image

rotx-eva commented 1 year ago

Please try calling the script with "--workers 0"

ermanok commented 1 year ago

Could you also check your python version, please? Your environment might be broken somehow.

ryalberti commented 1 year ago

I entered the --workers command around 30 min ago and I am still stuck at this point. I have also pressed 'enter' occasionally as I know it can sometimes stall if it isn't pressed. image

Additionally, I am using Python 3.8.11 in the 3 environments I have been testing this in. image

rotx-eva commented 1 year ago

I entered the --workers command around 30 min ago and I am still stuck at this point. I have also pressed 'enter' occasionally as I know it can sometimes stall if it isn't pressed.

It takes a long time without GPU. I don't know whether it's practical, but please let it run for a while. Do you see at least one core being used 100% in the Windows task scheduler/process monitor?

ryalberti commented 1 year ago

Yep! The utilization is hovering around 95-100% with the occasional drop 85%. I have no problem running it all day. In past trainings on this computer, I've let the training process run for 2 weeks with minimal issues. image

rotx-eva commented 1 year ago

We think we found the issue that required --workers 0 on Windows and macOS. I don't think it will make training any faster without a GPU, but you could try https://github.com/MaximIntegratedAI/ai8x-training/pull/243 Reference the training log in ai8x-synthesis trained/ai87-pascalvoc-fpndetector-qat8.log Only after epoch 2 will there be an mAP better than 0, so it might take a long time to see whether it's working as intended.

ryalberti commented 1 year ago

I've run the training over the weekend with the update and --workers 0. Attached are my logs so far. Do these results match the expected?

2023.07.25-141436.log

rotx-eva commented 1 year ago

Yes this looks good. It's learning and losses and mAP are reasonable for these early epochs. Compare to trained/ai87-pascalvoc-fpndetector-qat8.log in the ai8x-synthesis repo. The only observation is that each block of 50 shows "Time 72" or even higher for you, where it only takes 0.20... with a CUDA GPU (>350x faster).

ryalberti commented 1 year ago

Thanks so much! I'm glad to have these results figured out.