ClimbsRocks / auto_ml

[UNMAINTAINED] Automated machine learning for analytics & production
http://auto-ml.readthedocs.io
MIT License
1.64k stars 311 forks source link

Comparison with other automatic ML libraries? #230

Closed sergeyf closed 7 years ago

sergeyf commented 7 years ago

First, thank you very much for the hard work and awesome project. I think it will get a lot of use in my workflow.

I was surveying the landscape of automatic ML solutions, and found your package along with tpot and auto-sklearn. I am trying to figure out what kind of strengths and weaknesses all these packages have. Would you mind discussing what auto_ml does differently and/or better?

Thanks again.

ClimbsRocks commented 7 years ago

Hi @sergeyf !

Glad you found the package.

I can't claim to have used either package extensively, so my quick assessment might be wrong. Both of those packages appear to do great things! Happily, we all seem to focus on different things in our documentation.

Honestly, at the end of the day, model accuracy is going to be pretty similar across all projects. I'm sure each author has very good reason to think their project has an edge in model accuracy, but all of these projects are going to get you to numbers pretty close to each other. Which also means that if you're doing Kaggle competitions or the like, you can probably ensemble a model from each library together into a meta-estimator that's better than any one individually.

So it really comes down to your use case.

A few things that I'm particularly proud of with auto_ml (not sure if other libraries do these or not)

I'd love to hear your thoughts on the different packages!

sergeyf commented 7 years ago

Thanks for the thorough description! That is very helpful. I don't have many well-formed thoughts about the different packages yet, but I'll certainly get back to you when I have something useful to contribute.

PS, your response might be useful for others who find the package - perhaps it would work well on the readme?

rhiever commented 7 years ago

As the author of TPOT, just a couple notes from your answer:

ClimbsRocks commented 7 years ago

Sweet, thanks for the response @rhiever !

Out of curiosity, have you explored tsfresh at all? They've got the scikit-learn interface. Seems like the kind of thing that TPOT in particular would be great for.

rhiever commented 7 years ago

Haven't looked into tsfresh much, but definitely looks like something TPOT could wrap using a custom configuration. Maybe a PR is in order? :-)

calz1 commented 7 years ago

The automatic feature engineering in Auto_ML was a big decider for me. It automatically handles categorical variables, dates, and NLP on text strings. Last I looked at TPOT, I had to do a lot of preprocessing to get everything into a numeric format.

etemiz commented 6 years ago

Anyone who creates a dask-distributed interoperability will win the crowds! @mfeurer @rhiever @ClimbsRocks

byrro commented 6 years ago

@rhiever One important thing that Auto_ML offers (as well as auto-sklearn ) is a permissive license. TPOT uses LGPL, which is quite restrictive and anyone pursuing commercial purposes should stay away from it. Auto_ML and auto-sklearn, on the other hand, offer MIT and BSD-3clause licenses, respectively, which are very permissive for almost any kind of usage.

rhiever commented 6 years ago

The only major limitation of the LGPL is the source disclosure clauses. LGPL can still be used in commercial domains. I suspect most TPOT users use TPOT to find a pipeline for their problem and export it, and that generated code falls outside the LGPL disclosure clause.

It's been a long-term goal of mine to rewrite TPOT with a MIT license, but TPOT depends heavily on another project that is LGPL licensed.

byrro commented 6 years ago

Just because it can be used for commercial purposes, doesn't really mean you should. What are the implications? The LGPL terms are very confusing and obfuscated and it's very hard to understand what you can really do with LGPL software without compromising intellectual property that you intend to keep proprietary/closed. Anyone doing business seriously with an intention to build IP that is marketable in the future should either A) spend a lot of money with lawyers to make sure you do everything right and the LGPL library won't scare away any possible buyers/investors; or B) find another library that's licensed under simple and clear terms, such as BSD, MIT or Apache 2.

rhiever commented 6 years ago

Just a point of clarification: The common AutoML use case doesn't involve packaging the AutoML tool as a part of some product, which is the only case that the LGPL license will matter. Nearly all use cases I've seen in the wild involve using AutoML to find a pipeline for a problem, and exporting that pipeline to use independently from the AutoML tool. That use case is not affected by the LGPL.

The downside of the more permissive licenses (from a developer perspective) is that it makes it easy to take an open source AutoML tool and build a "AutoMLaaS" company around it, effectively cashing in on the developers' hard work without giving anything back to them or the open source community. This has already happened to the auto-sklearn developers, and they were not happy about it.

byrro commented 6 years ago

I totally agree that using an open source project in the core of a SaaS business without giving anything back to the community is a shame and I'd not be happy with it as well.

But there are other scenarios where GPL and variations could become a problem. Say you're using TPOT in a SaaS, meaning no packaging. All good. A few years later this big customer asks you to run your software on premise. Now you have a big problem: does GPL allow you to do it without having to compromise your intellectual property in other areas of the project that interacts with the GPL software? Or say you decide to patent other parts of your software that interact or rely on TPOT... There are lots of subtleties with GPL and the like that make it harder to answer these questions.

You might end up with a relieving "yes", but my point is: you can't be really sure without spending reasonable money with good lawyers to study your particular use-case. When you're starting up, you want to avoid this future liability AND avoid spending money with lawyers. Thus, the best for a small business is to stick with MIT, BSD or Apache 2.0. That's all I'm saying. LGPL will make it harder for SMEs to work with it. I'm not saying they can't work with it, I'm just saying: if there's an MIT/BSD alternative, I'll definitely stick to it.