elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
785 stars 323 forks source link

Incorrect processing of column names in NumberVectorLabelParser#getTypeInformation() #79

Closed paulk-asert closed 4 years ago

paulk-asert commented 4 years ago

When using NumberVectorLabelParser and supplying labelIndices, getTypeInformation is stopping after the desired number of column names has been reached even though some columns have been skipped. I used this data file: https://www.niss.org/sites/default/files/ScotchWhisky01.txt And designated RowID and Distillery as label indices. Before the change in PR #78 I see this (note the column names): image After the change I see this: image

kno10 commented 4 years ago

Thank you. Merged.

kno10 commented 4 years ago

FYI: as I've seen you work on Groovy. I had at some point a package that added some syntax sugar (inline operations or operators or something like that) for groovy. But I can't find it right now.

paulk-asert commented 4 years ago

Cool. I would be interested to see that if you can track it down. FYI, I have an ELKI/Groovy example (using above data as it happens) here: https://github.com/paulk-asert/groovy-data-science/blob/master/subprojects/Whiskey/src/main/groovy/KMeans_Elki.groovy

kno10 commented 4 years ago

The code would likely become simpler if you use ELKIBuilder more; as this will use default parameters in many cases.

paulk-asert commented 4 years ago

Good suggestion, I updated.

kno10 commented 3 years ago

You may be able to even do (untested)

def cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey',
            'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
def file = getClass().classLoader.getResource('whiskey.csv').file
def db = new ELKIBuilder(StaticArrayDatabase)
  .with('parser.labelIndices', '0,1')
  .with('dbc.in', file)
  .build()
db.initialize()

Depending on what level you want to use the API.

Getting rid of the easy to forget call to "initialize" is on my todo list. It must not be initialized in the constructor; and for benchmarking it is best to separate initialization from algorithm run time, but its easy to auto-initialize when the user did not not call this explicitly.

paulk-asert commented 3 years ago

That indeed works fine. Added, thanks!