Closed torronen closed 2 years ago
@LittleLittleCloud Would this behavior be fixed (removed) from the data frame PR?
@beccamc No it won't as that PR only changes how row number is counted, it doesn't change how tmp dataset is created.
@torronen Currently we don't check if cancellation token is triggered in the data processing step where tmp dataset is created since in most cases creating tmp files will be finished within seconds. We will see if we can add a cancellation checkpoint in that data processing step.
Meanwhile, Thanks for the remind of the remaining tmp file after training. For now ModelBuilder also doesn't remove any tmp file once it's created in any cases and it seems to be a mistake. We should at least remove tmp dataset and models as they will no longer be used once training is completed.
Oh my mistake, I didn't process that this was %temp% files.
@LittleLittleCloud Interesting, so maybe, I am using way bigger datasets than generally used. I think the first step (=everything before the first model is evaluated) is almost never completed for me in seconds, at least some minutes up to hours. Maybe I should consider some ways to skip some of the examples while keeping a representation of the original phenomena.
@LittleLittleCloud this only applies to large models, I think, but the cancellation can also take long during training. In this case 50 gb dataset, 119 model explored. Then I click "Cancel". Now it has taken over 8 hours to complete. My guess is it is waiting for the last experiment to complete.
The reason it is taking so long is because it is heavily paging:
Anyway, if it still completes one more experiment after clicking "Cancel" should that be also take into acocunt in the Top 5 models?
Right now I think it is not being taken into account, because the Top 5 models explored is already printed in models explored:
Even better would be if it could cancel the current experiment immediately, so users dont need to wait.
But this is just fine-tuning for large datasets, and not too big of an issue. The original issue is bigger because filling HDD can cause other problems on the system.
Actually it is a bit of annoying on dev machine. I have now about 650mb csv dataset on dev machine and waiting now 6hours for it to complete so I can restart. It seems Binary classification have more this issue than multiclass, for whatever reason.
@torronen The time might not be spent on waiting for the last trial to complete. From the output log most of the trials take much less than 1 hour, so should be the last trial. Besides, the training summary (top 5 model) has been presented, so it should mean the training part is completed.
I wonder maybe the time is spent on saving/loading the last model after training is completed? Could you check how much time it will take to reload and save the best model? Anyway, thanks for the feedback and we are looking into it.
@LittleLittleCloud ok, I will check the load and save times. You are probably right, on other occasions the cancellation has been quick, even if there has been an iteration running for a long time. Especially my dev machine has low ram, so that could be part of the issue. However, I believe on one occasion I did see one new training results appear in the output log after the training summary, but I can not confirm it.
@LittleLittleCloud Here is one case where last experiment finished after code was generated according to output window. See the last line in the screenshot. This seems to be rare, maybe 1/20 or less? It could be something else too which just affects the order of writing to output. It would not have changed the top results in this case.
For some reason, the log file only has this content:
2021-08-11 17:56:33.7927 INFO Set log file path to C:\Users\inno1\AppData\Local\Temp\MLVSTools\logs\4c1319f4-29dd-43b5-9a39-da472d1b4cff.txt (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
I have run multiple experiments after this one so other logs might be not be feasible anymore. Sorry, I did not yet check the load and re-save time as I got confused which one was that model. I'll update as soon as I encounter it.
@LittleLittleCloud Last week, our team member observed .zip file is generated before the Output window shows the final training results. During this time Train tab showed "Cancelling...". I will open a new ticket once I get some better info. For now, I don't have other information yet so and the experience from the team member may need to be confirmed on another occasion. So FYI if someone else notices similar issues.
@torronen Thanks for the FYI, Did your team member find .zip file in solution explorer? Or in %temp% folder.
Probably Git Changes window, but we do not have adequate notes about this. I will update if I notice it.
@JakeRadMSFT FYI
No longer an issue with FLAML AutoML
Describe the bug Cancel button does not interrupt reading of dataset. Dataset will be read until end. In case of low disk space, it may consume all free space and encounter an exception. Temp files will be left on disk.
To Reproduce
Expected behavior Pressing cancel button should interrupt reading of dataset. Temp files should be removed upon exceptions thrown.
Additional context May especially relate to VM's with low disk space.