dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.99k stars 1.88k forks source link

Add FieldAwareFactorizationMachine to AutoML #3985

Open justinormont opened 5 years ago

justinormont commented 5 years ago

FieldAwareFactorizationMachine is good for large dataset like the Criteo 1TB dataset.

Currently FieldAwareFactorizationMachine is not swept over in AutoML.

Task:

Should be easy to just replicate an existing trainer like SDCA: https://github.com/dotnet/machinelearning/blob/d518b587b06ac3896a48646622b0f2169a230855/src/Microsoft.ML.AutoML/TrainerExtensions/BinaryTrainerExtensions.cs#L150-L169

zHaytam commented 5 years ago

Hi, I would like to take this issue if possible.

justinormont commented 5 years ago

Sounds great.

This will be a change to the master branch as the AutoML API code has been merged into master. The CLI's CodeGen (which creates the C# project) still lives in the AutoML feature branch.

Perhaps @XiaroanZhang can create the CodeGen part, or walk you though what's needed. @daholste may also be able to offer advice. I'll be out for the next week, but feel free to submit a PR and have folks approve/merge in.

Thanks so much.

gokart23 commented 4 years ago

Hi @justinormont, looks like this issue is still open - do you mind if I work on it?

mstfbl commented 4 years ago

Hi @gokart23 and @zHaytam , if you guys are still interested, go for it!

Sandy4321 commented 4 years ago

if somebody tried to run locally ? like https://github.com/rambler-digital-solutions/criteo-1tb-benchmark

justinormont commented 3 years ago

Related: https://github.com/dotnet/machinelearning/issues/590

The related issue will allow FieldAwareFactorizationMachine to also be added to the multi-class AutoML task. Currently FieldAwareFactorizationMachine is only for binary classification due to not implementing the needed interfaces for OVA (one-vs-all / one-vs-rest).


@Sandy4321 : No one has run that specific set of benchmarks, which sweep over the training dataset size and records {AUC, log-loss, training time, max memory, cpu load}. I would be very interested in seeing the results if someone would like to run it against various ML․NET trainers.

I have run the Criteo 1TB dataset extensively within ML․NET. Beyond the single trainers, it does run successfully in the AutoML code as all steps in the process are resilient to datasets well beyond memory size. Streamable trainers will succeed on small machines, and other memory bound trainers should also succeed on very large memory machines.

Disabling caching and non-streamable trainers will help small machines succeed for the full Criteo 1TB dataset. Subsampling to reduce the training dataset size, as the benchmark shows, is the other main route to reduce memory usage.