coseal / aslib_data

Algorithm Selection scenario data
30 stars 15 forks source link

Fixing OPENML scenario #10

Open mlindauer opened 7 years ago

mlindauer commented 7 years ago

Hi,

I looked into the "old" OPENML scenario. I already fixed some issues. Right now my tool complains that the order of the columns in algorithm_runs.arff is wrong. Right now it is:

@attribute instance_id STRING
@attribute algorithm STRING
@attribute acc numeric
@attribute repetition numeric
@attribute runstatus {ok, timeout, memout, not_applicable, crash, other}

but it should be

@ATTRIBUTE instance_id STRING
@ATTRIBUTE repetition NUMERIC
@ATTRIBUTE algorithm STRING
@ATTRIBUTE acc NUMERIC
@ATTRIBUTE runstatus {ok, timeout, memout, not_applicable, crash, other}

Furthermore, I wonder why every feature is in its own group. Since there are no feature costs and the runstatus is always ok, we could put all features into one group. In the end, this is not a real issue.

Best, Marius

larskotthoff commented 7 years ago

IIRC we put them all into different groups for feature selection -- if everything is in one group, adding one feature is the same as adding all of them.

mlindauer commented 7 years ago

How do we proceed with the other issue? Is it easy for you in R to fix the column order problem? Or who is responsible?

larskotthoff commented 7 years ago

Done. Is everything ok now?

joaquinvanschoren commented 7 years ago

Is there anything I should do? On Thu, 27 Oct 2016 at 19:16, Lars Kotthoff notifications@github.com wrote:

Done. Is everything ok now?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/coseal/aslib_data/issues/10#issuecomment-256708880, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV0dSfY41L8CwLI5mqQ2IzWi3zP-Vks5q4NsVgaJpZM4Kf8DM .

mlindauer commented 7 years ago

At least I can read the data now (after replacing each ' with "). To generate plots is not really working because the algorithm names are so long. Could we shorten them? e.g., cut the common prefix "weka.classifier" and move the hyperparameter configuration into a readme?

larskotthoff commented 7 years ago

The hyperparameter configuration is part of the algorithm, otherwise we'd get misleading results when doing selection. Could we instead do something to adjust the plots?

mlindauer commented 7 years ago

Some algorithm names do not even fit in one line on github, e.g. 1160_weka.classifiers.meta.AttributeSelectedClassifier -- -E \weka.attributeSelection.PrincipalComponents -R 0.95 -A 5\ -S \weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1 -W weka.classifiers.trees.J48 -- -C 0.25 -M 2

How would you compress it such that it can fit into a plot? All my tries so far lead to plots where the algorithm names were much longer than the actual plot size. I agree that we should still be able to distinguish between different hyperparameter configurations but the details could go into the readme. For example, the above name could shortened to PrincipalComponents_Conf1

What do you think?

larskotthoff commented 7 years ago

Hmm, we don't have anything for this specified in the data format specification. I'm hesitant to come up with an ad-hoc version just to fix the plots. If this doesn't break anything else, I'd prefer doing this only in the plots, i.e. having labels "algorithm1", "algorithm2" etc and a legend somewhere that tells you what algorithm1 is.

joaquinvanschoren commented 7 years ago

For plots I would simply cut off the name after the nth characters and add an ellipsis (...). The main algorithm is always in the beginning of the string. On Fri, 28 Oct 2016 at 20:04, Lars Kotthoff notifications@github.com wrote:

Hmm, we don't have anything for this specified in the data format specification. I'm hesitant to come up with an ad-hoc version just to fix the plots. If this doesn't break anything else, I'd prefer doing this only in the plots, i.e. having labels "algorithm1", "algorithm2" etc and a legend somewhere that tells you what algorithm1 is.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/coseal/aslib_data/issues/10#issuecomment-256988320, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQVxydyWpvHOwl-9cVvRGhrr67X6ooks5q4jk6gaJpZM4Kf8DM .

mlindauer commented 7 years ago

For plots I would simply cut off the name after the nth characters and add an ellipsis (...).

Ok, I will do this. However, I'm still not so happy with the solution. If I would talk about these algorithms in a paper, I would also need to come up with some meaningful abbreviation (instead of "algorithm1" or "algori(...)"). Therefore I would prefer to consistently solve the problem in this scenario right away.

Maybe we should extend the ASlib format such that parameter configuration can be specified somewhere. For example, in the ASP-POTASSCO scenario it is only 1 solver with 11 different configurations.

larskotthoff commented 7 years ago

That sounds like a good idea. Let's talk more about this in a bigger group.

hhoos commented 7 years ago

On 28 Oct 2016, at 11:04, Lars Kotthoff notifications@github.com wrote:

Hmm, we don't have anything for this specified in the data format specification. I'm hesitant to come up with an ad-hoc version just to fix the plots. If this doesn't break anything else, I'd prefer doing this only in the plots, i.e. having labels "algorithm1", "algorithm2" etc and a legend somewhere that tells you what algorithm1 is.

That sounds like a much better solution than using ellipses, which can get confusing and potentially misleading.

Cheers,

Holger

joaquinvanschoren commented 7 years ago

That would be good. In the OpenML case there can be any number of configurations, but still I can give you a structured representation.

You still need a way to compress that information in your plots, though. On Sat, 29 Oct 2016 at 22:42, hhoos notifications@github.com wrote:

On 28 Oct 2016, at 11:04, Lars Kotthoff notifications@github.com wrote:

Hmm, we don't have anything for this specified in the data format specification. I'm hesitant to come up with an ad-hoc version just to fix the plots. If this doesn't break anything else, I'd prefer doing this only in the plots, i.e. having labels "algorithm1", "algorithm2" etc and a legend somewhere that tells you what algorithm1 is.

That sounds like a much better solution than using ellipses, which can get confusing and potentially misleading.

Cheers,

Holger

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/coseal/aslib_data/issues/10#issuecomment-257114897, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV5mLp6lh-6elFQ24hB70xUYALcQ7ks5q46-ygaJpZM4Kf8DM .

hhoos commented 7 years ago

On 29 Oct 2016, at 22:07, Joaquin Vanschoren notifications@github.com wrote:

That would be good. In the OpenML case there can be any number of configurations, but still I can give you a structured representation.

You still need a way to compress that information in your plots, though.

I understand Lars’s suggestion as saying that, in cases where algorithm (or configuration) names get too long, we call them “algorithm 1”, …, and that we use these labels in plots, along with a specification of what “algorithm 1” etc. really means. IMO, this could be done in the caption.

Would that address your concern, or am I missing something?

Cheers,

Holger

mlindauer commented 7 years ago

Hi everyone,

If nobody has any further objections, we could move the OPENML scenario in the master branch.

Cheers, Marius

joaquinvanschoren commented 7 years ago

Ok by me. Regarding your earlier comments, I could make a new version with fewer missing values, but didn't have time yet.

On Wed, Nov 23, 2016 at 10:00 AM Marius Lindauer notifications@github.com wrote:

Hi everyone,

If nobody has any further objections, we could move the OPENML scenario in the master branch.

Cheers, Marius

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/coseal/aslib_data/issues/10#issuecomment-262461739, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV25FaDFijSvt97Syo0vMYrerO1nTks5rBADFgaJpZM4Kf8DM .

mlindauer commented 7 years ago

OK, so we will wait for this new version with fewer missing values? Once the scenario is in the master branch, I would like to avoid to update it too often.

larskotthoff commented 7 years ago

I vote for moving it to master now. A new scenario with fewer missing values could be OpenML-2017 or something like that and would probably involve some new algorithms as well.

mlindauer commented 7 years ago

Regarding your earlier comments, I could make a new version with fewer missing values, but didn't have time yet.

@joaquinvanschoren could you please give us a rough estimate, when you will have time to add further features? The underlying question is whether we should wait for it for a new ASlib release, or whether we will release the new version first and then add your scenario later.

joaquinvanschoren commented 7 years ago

Do you have a GitHub repo to upload the new files?

joaquinvanschoren commented 7 years ago

Is it here? https://github.com/coseal/aslib_data/tree/not_verified/OPENML

mlindauer commented 7 years ago

Yes.