gbif / parsers

Various GBIF parsers for dates, countries, language, taxon ranks, etc
Apache License 2.0
4 stars 8 forks source link

Numbers with commas are ambiguous #23

Open MattBlissett opened 4 years ago

MattBlissett commented 4 years ago

Many European languages use a comma as a decimal separator: €1,50

This is currently parsed by the NumberParser, but is ambiguous when English usage has the comma as a thousands separator: £1,500

Sometimes the number is unambiguous (1,234.56 = 1.234,56), but other than these cases I think accepting commas is potentially introducing errors — we don't know if 1,100m is on a mountain or by the sea.

This needs investigation of prevalence of decimal commas. It's easy to export numbers with decimal commas from Excel if the locale is set accordingly.

mdoering commented 4 years ago

I live in one of those countries and this is by far the nastiest locale difference. I would expect even careful people to mostly keep the native locale and export decimals with comma.

But should this not at least be a problem for the entire dataset? I would suggest to keep a dataset specific setting that we can manually activate to use a comma based decimal. In this case we need to pass in a preferred decimal delimiter to the parser