elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
781 stars 321 forks source link

issue parsing polygons using SimplePolygonParser #88

Closed frewenta closed 3 years ago

frewenta commented 3 years ago

Description of format for loading of Polygon data from file (Class SimplePolygonParser) seems (?) to suggest the following format for 2D polygon data ("One record per line, points separated by whitespace, numbers separated by colons (I am trying to exercise generalized DBSCAN here with polygons). Multiple polygons components can be separated using --"):

0.60:0.43 0.45:0.1 0.4:0.4 -- 0.6:0.32 0.45:0.2 0.56:0.43

but this fails with Task failed elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,variable Available types: DBID PolygonsObject ExternalID

Looking at the parser source code, in SimplePolygonPaser.java COORD for matcher has this regex pattern ^([+-]?(?:\d+.?|\d.\d+)?(?:[eE][-]?\d+)?),\s([+-]?(?:\d+.?|\d.\d+)?(?:[eE][-]?\d+)?)(?:,\s([+-]?(?:\d+.?|\d*.\d+)?(?:[eE][-]?\d+)?))?$

but this would seem to match something of the form

"0.456,0.45" (comma-separated, this matches the pattern but seems to hit other issues)

and not

"0.456:0.45" (colon-separated)

for a 2D polygon vertex

One line of text giving example of input format for polygons would be very useful

kno10 commented 3 years ago

The error you are reporting ("No supported data type") is not related to the parser. The distance function or algorithm that you are using expects vectors, not polygons. As it finds only labels and polygons, but no coordinate vectors, it fails to find a supported data type.

Please customize the polygon parser to your input format - it is not some standard format, and the code has not been used for a long time; it was used for visualization only, if I recall correctly. Also, you will likely need to implement the GDBSCAN predicates as desired for the polygon input type, or a suitable distance function for polygons to use with distance-based predicates in DBSCAN.

The documentation seems to be incorrect, it probably meant "comma", not "colon".