Closed lauritowal closed 1 year ago
TODO: Check out the following https://www.geeksforgeeks.org/broken-pipe-error-in-python/
answer from BASALT Support 5 October 21:11:
Miffyli — Today at 8:59 PM Huuh... That would mean some of the processes just died outright. Have you checked your memory usage and if it approaches using all memory? walt — Today at 9:02 PM Memory for CPUs and GPU was actually fine. It does not approach using all memory... Around half was used Miffyli — Today at 9:03 PM Hm... Was 2000 batches the limit? The closing code might be a bit derp, where the main thread just quits at the 2000 batches and others crash. I thought I had more cleaner closing for that tho walt — Today at 9:04 PM yes. 2000 was the limit: MAX_BATCHES = 2000 if USING_FULL_DATASET else int(1e9) Miffyli — Today at 9:05 PM ah yup that probably is the reason. The run finished fine, it just did not clean up the threads as cleanly as I thought it would 😅 walt — Today at 9:06 PM ah yup that probably is the reason. The run finished fine, it just did not clean up the threads as cleanly as I thought it would 😅 mm but the training stops though after a few times when that appears Miffyli — Today at 9:07 PM Yeap, but that is expected, as it did reach 2000 batches. But yes, if those processes crash like that, the main code will probably hang, trying to join the processes and just halt there 😅. You could add a timeout to the join operation as a quick fix, and if timeout passes, catch the exception with try-except block. Yes, not a real fix and I should look into, but I can not promise when I have time walt — Today at 9:09 PM Alright, thanks a lot!!
We have set max batch size to 1e9 and do safe the models after each 1000 batches. The error does not appear anymore
I get the following error when training on the full dataset (which lies on wombat)