biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.7k stars 993 forks source link

FR/OMISSION: Models' parameters max limits way too low #2935

Closed dsanalytics closed 5 years ago

dsanalytics commented 6 years ago

Max limits for some widgets are unreasonably low. I do understand your concern regarding potential curse of dimensionality, high cardinality issues, etc, but that should be left to a Data Scientist to decide.

So having said that, can the following be changed: 1) Discretize widget -> Equal freq - number of intervals: 100 max (I do know that this is very high) 2) Discretize widget -> Equal width - number of intervals: 100 max (same comment) 3) NN widget -> Max Iterations: 1000000 max (like SVM) 4) Preprocess - Discretize -> Equal freq - number of intervals: 100 max (same comment) 5) Preprocess - Discretize -> Equal width - number of intervals: 100 max (same comment)

I'm sure there are other model widgets with way too low max limits as well.

Thank you in advance for your time and help.

orange-sampling-limits-nn-limits-annotated

orange-processing-discretize-limits-annotated

astaric commented 6 years ago

@kernc, I also thought about marking this as a good first issue, but I would hate to see a PR made for this which would end up being rejected because there would be opposition against raising the limit. @biolab/orange, does anyone have a good reason not to do this/alternative approach?

Personally, I do not mind raising the max number of bins for discretization. It might turn out that the slider in Preprocess widget will need to be replaced with another control (it is hard to select an exact number with a large-range slider). For the number of iterations in neural networks, if 1000000 is reasonable (can train a small network in short enough time), I do not mind.

dsanalytics commented 6 years ago

Couple of points regarding @astaric comment:

In general, if you try to make a product safe for people that do not know what they are doing, you are only going to make it unusable for people that do.

dsanalytics commented 6 years ago

@astaric I'd really appreciate if you could make these 5 max limits changes at your earliest convenience. Your colleague gave it thumbs up, which seems to me as approval to go ahead. Again, these are not changes for default values, but for max values allowed. Thanks again.

markotoplak commented 6 years ago

@dsanalytics, thank you for your suggestions, they all make sense. We are currently busy with other parts of Orange, but will get to them eventually. You are also welcome to make the changes yourself and submit a pull request.

There is a potential problem regarding iterations limit in Neural Networks. Currently the user can not interrupt NN fitting, and, if they made a mistake, they would have to close Orange. If we made computation interruptable and add progress bars, we can make NN both safe and usable. We are looking into it. There might be some problems though because scikit-learn does not provide any explicit mechanisms for stopping computation.

dsanalytics commented 6 years ago

Thanks @markotoplak I think you worry too much - just increase the limit and let each data scientist worry about how long is going to take for their specific problem. Most software packages/libs do not allow stopping and it's ok. Current limit of only 300 is far bigger impediment to working with NN in Orange.

janezd commented 6 years ago

I try to not get involved any longer, but I have to comment on that.

In general, if you try to make a product safe for people that do not know what they are doing, you are only going to make it unusable for people that do.

I think Orange is very usable as it is. It is only unusable to people who would like to discretize into 100 bins, although there is nothing that Orange could do with discrete variables with 100 values. :) Scatter plot showing 100 different colors or shapes? (Note that Orange's discrete variables are nominal, not ordinal!) Trees with 100 branches at each node? Naive Bayesian classifier with 100 bins (without much data in any of them)? Logistic regression with 100 indicator variables instead of a single continuous variable? Distances between 100 indicators?

It gets even worse. In God knows how many methods we have assumed that the number of different values is going to be reasonably small. Some methods would simply run forever if you gave them variables with 100 values. Some methods, like induction of binary trees, would reject this and say they won't binarize such features.

Orange imposes reasonable constraints because by letting the user do stupid things, you just postpone the problem -- you get errors downstream, either in a sense that the user gets a nice error message or Orange crashes (it shouldn't) or methods take forever or, in most cases, they don't do anything useful.

In general, if you try to make a product safe for people that do not know what they are doing, you are only going to make it unusable for people that do.

I have a strong feeling that people that try to have 100 bins in Orange do not know what they do. :)

The current limit is not there for technical reasons. There is probably no use case for discrete variables with 100 values, at least not in Orange.

However, this is open source and it's Python (no compiling involved), so if any data scientist is experienced enough to know why his case is very special, he can easily open the source code and put a different number in. If he can't do this, then he shouldn't.

dsanalytics commented 6 years ago

Dear professor, thank you for taking the time to comment.

It appears that you did not read my post properly - I clearly noted that I do know that 100 is very high. I also clearly noted about curse of dimensionality and high cardinality. So I believe that your insults are misplaced.

Apparently, you somehow know that 10 bins should be enough for any real world problem, while 12, 15, 25 or more is too much.

If you are not approving this change, then for the benefit of Orange project reputation among practitioners, that could use it for their work and promote it indirectly that way, give 50 bins as max a good consideration.

Thank you professor.

janezd commented 6 years ago

If you feel insulted I already regret getting involved in the first place. I apologize; it was not meant like this, although on the second read I see why it sounded all wrong.

I don't think there's any kind of hard threshold, but considering what Orange can do with discrete attributes I have problems imagining a situation or a widget where one could use attributes with more than 10 or so values. But this is just my opinion and it doesn't need to count for much; I kind of left the ship two months ago and others will have to make the decision.

I just wanted to add my two cents and I apologize again for throwing them too hard.

dsanalytics commented 5 years ago

@janezd Has this been abandoned without being implemented? If so, please reconsider for it limits Orange in real life usages a great deal.

P.S. I thought that you "... try to not get involved any longer ..."

ajdapretnar commented 5 years ago

@dsanalytics If @janezd does not get involved into Orange, we might as well abandon the project since he is the main driving force behind it and its very author.

I agree that we keep the settings as they are. Advanced users can always work with pure Python and its numerous libraries or any other platform that supports higher values.