altoxml / schema

ALTO XML schema - latest and all former versions
51 stars 4 forks source link

Restrict PointsType to a well defined format #80

Open cipriandinu opened 1 year ago

cipriandinu commented 1 year ago

This topic is derived from https://github.com/altoxml/schema/issues/49. On previous issue we focus on changing documentation and announce PointsType restrictions, and on this topic we will have the discussion regarding restrictions implementation, for version 5.0

mittagessen commented 1 year ago

The most common form of PointsType, as well as the one in all of the examples, seems to be of the x1 y1 x2 y2 ... x_n y_n. As such I'd suggest standardizing to this format and explicitly define it as a string containing an even number of floats with a regex like this (non-functional, just a sketch):

([0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?\s+[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)+

It might also be advantageous to split PointsType into two different types, one for simple points sequences such as BASELINE that require at least two points, and one for Polygon that requires at least three points.

cipriandinu commented 8 months ago

I propose following - a bit more complex, but looks it works (should handle numbers as integer or floating point expressed as 2.4 or .5 or 1.832e-5 or 1.322E-5 or 3.4e+3 or 2.3E+3 or 2.4E5 or 8.9e2, comma separated inside a pair, or space separated inside a pair and at least two pairs - this is useful for basic baseline, not for poly shape). This would match with what we stated into version 4.4

(\d?\d(.\d+([eE][-+]?\d+)?)?\s[,\s]\s\d?\d(.\d+([eE][-+]?\d+)?)?\s+)+\d?\d(.\d+([eE][-+]?\d+)?)?\s[,\s]\s\d?\d(.\d+([eE][-+]?\d+)?)?