dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
264 stars 56 forks source link

Cancellation of training during dataset reading does not interrupt creation of temp files #1645

Closed torronen closed 2 years ago

torronen commented 3 years ago

Describe the bug Cancel button does not interrupt reading of dataset. Dataset will be read until end. In case of low disk space, it may consume all free space and encounter an exception. Temp files will be left on disk.

To Reproduce

  1. Create dataset bigger than free disk space.
  2. Create new training task.
  3. Click Train.
  4. Click cancel.
  5. Observe disk will be full and "Exception thrown in writing" exception will be thrown.
  6. Observe, temp files are left on the disk.

Expected behavior Pressing cancel button should interrupt reading of dataset. Temp files should be removed upon exceptions thrown.

Additional context May especially relate to VM's with low disk space.

beccamc commented 3 years ago

@LittleLittleCloud Would this behavior be fixed (removed) from the data frame PR?

LittleLittleCloud commented 3 years ago

@beccamc No it won't as that PR only changes how row number is counted, it doesn't change how tmp dataset is created.

@torronen Currently we don't check if cancellation token is triggered in the data processing step where tmp dataset is created since in most cases creating tmp files will be finished within seconds. We will see if we can add a cancellation checkpoint in that data processing step.

Meanwhile, Thanks for the remind of the remaining tmp file after training. For now ModelBuilder also doesn't remove any tmp file once it's created in any cases and it seems to be a mistake. We should at least remove tmp dataset and models as they will no longer be used once training is completed.

beccamc commented 3 years ago

Oh my mistake, I didn't process that this was %temp% files.

torronen commented 3 years ago

@LittleLittleCloud Interesting, so maybe, I am using way bigger datasets than generally used. I think the first step (=everything before the first model is evaluated) is almost never completed for me in seconds, at least some minutes up to hours. Maybe I should consider some ways to skip some of the examples while keeping a representation of the original phenomena.

torronen commented 3 years ago

@LittleLittleCloud this only applies to large models, I think, but the cancellation can also take long during training. In this case 50 gb dataset, 119 model explored. Then I click "Cancel". Now it has taken over 8 hours to complete. My guess is it is waiting for the last experiment to complete.

The reason it is taking so long is because it is heavily paging: image

Anyway, if it still completes one more experiment after clicking "Cancel" should that be also take into acocunt in the Top 5 models?

Right now I think it is not being taken into account, because the Top 5 models explored is already printed in models explored: image

Even better would be if it could cancel the current experiment immediately, so users dont need to wait.

But this is just fine-tuning for large datasets, and not too big of an issue. The original issue is bigger because filling HDD can cause other problems on the system.

torronen commented 3 years ago

Actually it is a bit of annoying on dev machine. I have now about 650mb csv dataset on dev machine and waiting now 6hours for it to complete so I can restart. It seems Binary classification have more this issue than multiclass, for whatever reason.

LittleLittleCloud commented 3 years ago

@torronen The time might not be spent on waiting for the last trial to complete. From the output log most of the trials take much less than 1 hour, so should be the last trial. Besides, the training summary (top 5 model) has been presented, so it should mean the training part is completed.

I wonder maybe the time is spent on saving/loading the last model after training is completed? Could you check how much time it will take to reload and save the best model? Anyway, thanks for the feedback and we are looking into it.

torronen commented 3 years ago

@LittleLittleCloud ok, I will check the load and save times. You are probably right, on other occasions the cancellation has been quick, even if there has been an iteration running for a long time. Especially my dev machine has low ram, so that could be part of the issue. However, I believe on one occasion I did see one new training results appear in the output log after the training summary, but I can not confirm it.

torronen commented 3 years ago

@LittleLittleCloud Here is one case where last experiment finished after code was generated according to output window. See the last line in the screenshot. This seems to be rare, maybe 1/20 or less? It could be something else too which just affects the order of writing to output. It would not have changed the top results in this case.

image

For some reason, the log file only has this content: 2021-08-11 17:56:33.7927 INFO Set log file path to C:\Users\inno1\AppData\Local\Temp\MLVSTools\logs\4c1319f4-29dd-43b5-9a39-da472d1b4cff.txt (Microsoft.ML.ModelBuilder.Utils.Logger.Info)

I have run multiple experiments after this one so other logs might be not be feasible anymore. Sorry, I did not yet check the load and re-save time as I got confused which one was that model. I'll update as soon as I encounter it.

torronen commented 3 years ago

@LittleLittleCloud Last week, our team member observed .zip file is generated before the Output window shows the final training results. During this time Train tab showed "Cancelling...". I will open a new ticket once I get some better info. For now, I don't have other information yet so and the experience from the team member may need to be confirmed on another occasion. So FYI if someone else notices similar issues.

LittleLittleCloud commented 3 years ago

@torronen Thanks for the FYI, Did your team member find .zip file in solution explorer? Or in %temp% folder.

torronen commented 3 years ago

Probably Git Changes window, but we do not have adequate notes about this. I will update if I notice it.

beccamc commented 3 years ago

@JakeRadMSFT FYI

luisquintanilla commented 2 years ago

No longer an issue with FLAML AutoML