Open justinormont opened 5 years ago
Hi, I would like to take this issue if possible.
Sounds great.
This will be a change to the master branch as the AutoML API code has been merged into master. The CLI's CodeGen (which creates the C# project) still lives in the AutoML feature branch.
Perhaps @XiaroanZhang can create the CodeGen part, or walk you though what's needed. @daholste may also be able to offer advice. I'll be out for the next week, but feel free to submit a PR and have folks approve/merge in.
Thanks so much.
Hi @justinormont, looks like this issue is still open - do you mind if I work on it?
Hi @gokart23 and @zHaytam , if you guys are still interested, go for it!
if somebody tried to run locally ? like https://github.com/rambler-digital-solutions/criteo-1tb-benchmark
Related: https://github.com/dotnet/machinelearning/issues/590
The related issue will allow FieldAwareFactorizationMachine
to also be added to the multi-class AutoML task. Currently FieldAwareFactorizationMachine
is only for binary classification due to not implementing the needed interfaces for OVA (one-vs-all / one-vs-rest).
@Sandy4321 : No one has run that specific set of benchmarks, which sweep over the training dataset size and records {AUC, log-loss, training time, max memory, cpu load}. I would be very interested in seeing the results if someone would like to run it against various ML․NET trainers.
I have run the Criteo 1TB dataset extensively within ML․NET. Beyond the single trainers, it does run successfully in the AutoML code as all steps in the process are resilient to datasets well beyond memory size. Streamable trainers will succeed on small machines, and other memory bound trainers should also succeed on very large memory machines.
Disabling caching and non-streamable trainers will help small machines succeed for the full Criteo 1TB dataset. Subsampling to reduce the training dataset size, as the benchmark shows, is the other main route to reduce memory usage.
FieldAwareFactorizationMachine is good for large dataset like the Criteo 1TB dataset.
Currently FieldAwareFactorizationMachine is not swept over in AutoML.
Task:
Should be easy to just replicate an existing trainer like SDCA: https://github.com/dotnet/machinelearning/blob/d518b587b06ac3896a48646622b0f2169a230855/src/Microsoft.ML.AutoML/TrainerExtensions/BinaryTrainerExtensions.cs#L150-L169