elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
780 stars 321 forks source link

No data type found satisfying: NumberVector,field AND NumberVector,variable #39

Closed bastian-wur closed 6 years ago

bastian-wur commented 6 years ago

Hi everyone,

I'm currently trying to use ELKI for clustering some rather big data... or better said, I'd like to use if, if it would let me. I've used it before, where it worked, but now something is going wrong. I've cut it down to a file consisting out of 10 rows with 7 columns with float numbers (attached, absolute_counts_per_contig.csv.percentages_per_row.head_10_columns_7.txt), and I also used the latest ELKI version (just cloned it a minute ago), and I still get an error.

The command + error is the following:

14:30:44 bastian@computer:~$ java -Xmx11G -jar /exports/mm-hpc/bacteriologie/bastian/tools/elki/elki-bundle-0.7.2-SNAPSHOT.jar KDDCLIApplication -dbc.in /exports/mm-hpc/bacteriologie/bastian/data/absolute_counts_per_contig.csv.percentages_per_row.head_10_columns_7.csv -out /exports/mm-hpc/bacteriologie/bastian/data/elki_results/kmeans_perc_per_row_maxiter_10000/2/ -algorithm clustering.kmeans.KMeansLloyd -kmeans.k 2 -kmeans.maxiter 10000   -evaluator clustering.internal.EvaluateDaviesBouldin,clustering.internal.EvaluatePBMIndex,clustering.internal.EvaluateSquaredErrors,clustering.internal.EvaluateVarianceRatioCriteria,clustering.internal.EvaluateSimplifiedSilhouette -parser.colsep \\t -resulthandler ResultWriter 
No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=5,maxdim=7 LabelList
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=5,maxdim=7 LabelList
        at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:123)
        at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:79)
        at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:100)
        at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:109)
        at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:58)
        at de.lmu.ifi.dbs.elki.application.AbstractApplication.runCLIApplication(AbstractApplication.java:184)
        at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.main(KDDCLIApplication.java:93)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at de.lmu.ifi.dbs.elki.application.ELKILauncher.main(ELKILauncher.java:77)

It looks like there is some issue with parsing the columns...but I really cannot see it, all columns in all rows have values. Any advice what could be going wrong?

Thanks, Bastian

kno10 commented 6 years ago

It fails to parse the following two numbers: 0.11753855483588400432 0.28323801662637815291 The reason probably is that these are too precise for a double, and our parser then (unfortunately, without better error handling - it is supposed to report a precision overflow error) gives up and they will be handled as strings. The closest doubles apparently are the following: 0.117538554835884 0.28323801662637815 How come that your data has 20 decimal digits of precision, when double only provides about 16? Do you need that extra precision? Or is that just used for alignment purposes of the csv file?

Our parser reads the decimal digits into a long. If that overflows, it fails. But 11753855483588400432 - the decimals of above numbers - exceed 2^63 Our parser handles all 18 digit, and most 19 digit numbers; and a double only provides about 16 digits; so usually we have some safety margin there...

I would accept a patch to ParseUtil#parseDouble (check for PRECISION_OVERFLOW) if it does not degrade performance. Otherwise, I would prefer a patch to improve error handling in the number vector parser that catches the precision overflow and (at least) outputs a warning. Clearly, treating these numbers as strings can be very confusing. The main motivation is that for integers, such very long integers usually indicate this is some kind of identifier column, and then automatically treating them as strings is actually helpful...

bastian-wur commented 6 years ago

aaaaah, okay, thanks, that makes sense.

There's no real reason why I have 20 decimal points. I just had multiple tools which failed to parse the scientific notation for the numbers, so I put my script to a random number which I thought should be high enough to catch most of the numbers without resorting to 0. I actually did not consider the actual floating point precision lol (now I feel stupid). I haven't tested yet if changing the precision will fix the issue, but I absolutely believe you -> issue closed. Thanks for the quick help :).

kno10 commented 6 years ago

In 748252bc686a96a21b9bd10861fc7b43f572a44e I added a warning Too many digits in what looked like a double number - treating as string when a too-long float is interpreted as a string instead.

StatguyUser commented 6 years ago

I am trying to cluster using KMEANS algorithm for a sparse data from a doc2vec model. It has 60000*300 dimension and data points have average length of 22, for example 0.00000804921828675072 When i cluster this dataset, i am getting below error

Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList
    at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
    at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
    at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
    at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
    at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
    at [...]

Is there any data type in i should select in parser.vector-type which can handle this data or should i do anything else to fix this and be able to run this successfully?

kno10 commented 6 years ago

For such data, exponential formatting is more appropriate, and should work.

I.e., 8.04921828675072e-6is the common way of storing such data.

StatguyUser commented 6 years ago

exponential formatting in the input CSV file or is there any setting in ELKI for that?

kno10 commented 6 years ago

As regular notation does not contain an "e", it will autodetect exponential notation. It literally just reads the number, and when encountering e-6 at the end, multiplies it by 10^-6. It is very common, everbody uses it, so it cannot be turned off, but is always on.