Open-Historical-Map-Labs / openhistoricaltiles

First iteration of vector tiles from OHM Planet data
BSD 2-Clause "Simplified" License
3 stars 0 forks source link

Time slider in a world of divergent date formats #37

Closed danrademacher closed 5 years ago

danrademacher commented 5 years ago

As we have known from our first engagement with this work, dates come in many different formats in OHM.

Our code those far has assumed YYYY-MM-DD, and ideally the community would settle on a single standard and go with that. Though uncertainty is part of the picture, and what to do about that?

Here's an example of an area with lots of year-only dates in 1830: http://www.openhistoricalmap.org/edit#map=18/49.54794/17.73503

Timeslider demos here are blank because dates don't conform to our code's expectations. Do we start down the road of trying to guess dates from different formats? Or do we potentially look to combine Date Errors highlight work with timeslider?

daelba commented 5 years ago

The reason for year-only tagging is that we know only the year, when the source map was created, e.i. the year when we are sure, that the building existed.

There are two options now:

  1. Change the code and valuate year-only dates as xxxx-01-01 or xxxx-12-31.
  2. Change the data: all dates to 1830-01-01 or 1830-12-31. However, if we consider, that in much (maybe most) cases, we know only the year of creation/mapping, the "incorrect" tagging will arise repeatedly. So if we choose the second option, an authomatical correction of year-only dates would be necessary.
danrademacher commented 5 years ago

Considering the shortened elements here: https://en.wikipedia.org/wiki/ISO_8601

The standard also allows for calendar dates to be written with reduced accuracy.[16] For example, one may write "1981-04" to mean "1981 April". The 2000 version allowed writing "--04-05" to mean "April 5"[22] but the 2004 version does not allow omitting the year when a month is present. One may simply write "1981" to refer to that year or "19" to refer to the century from 1900 to 1999 inclusive. Although the standard allows both the YYYY-MM-DD and YYYYMMDD formats for complete calendar date representations, if the day [DD] is omitted then only the YYYY-MM format is allowed. By disallowing dates of the form YYYYMM, the standard avoids confusion with the truncated representation YYMMDD (still often used).

re dates, so we could comply with the standard while accounting for imprecision where first 4 digits must be YYYY, first 7 are YYYY-MM, and then 10 full YYYY-MM-DD

danrademacher commented 5 years ago

Let's have timeslider maker @gregallensworth weigh in and see what he thinks about using ISO 8601 YYYY, YYYY-MM, annd YYYY-MM-DD in the slider code and filters but not going fully into trying to parse out other date formats. Maybe that's a workable middle way that also doesn't force false precision on the source data.

gregallensworth commented 5 years ago

If the TimeSlider were to continue filtering by year only, then the dates themselves could also be years only. The omission of month or day would be just fine; in fact, if the year were our only resolution it would be arguable whether there were a need for month/day inputs at all.

However, I expect that later phases will want to deliver higher-resolution behavior, down to date. And there, partial dates fall very flat.

MBGL's Filtering

MBGL's filtering system is a fairly simplistic set of filters, basically just GTE >=, LTE <=, EQ ==, and NE != No ability is provided to check the length of the date string, to parse it with a regular expression, to modify a copy of the string for comparison purposes, and so on. Just simple string comparisons, on the data as-given.

The date comparison is a simple string comparison using >= and <=:

However, these string comparisons fall flat when we have malformed date (including truncated dates). For example: 2017 is not within the range of 2017-01-01 - 2017-12-31 and therefore would not match the comparison.

So, in the generated vector tiles being served out to consumers, a proper YYYY-MM-DD is an absolute requirement, if we intend to move forward into higher time resolutions. If this were not a goal, then simple years would be fine, and some existing code could be simplified.

Filling Missing Month/Day in DB or Vector Tile Generation

In theory, some processing step of importing the OHM data into the vector tile system, and/or the views and queries used to produce vector tiles, could introduce false month/day components where they are missing.

If a date of 2000 were a start_date then it would be changed into 2000-01-01, and if it were an end_date then it would become 2000-12-31 Thus, the vector tiles would always have proper YYYY-MM-DD dates and be amenable to filtering at whatever resolution.

R&D would be required, to come up with a mechanism to do this. There are scores of tables and views managed by the OpenMapTiles import process, and also date format issues such as "about 1970" which could throw a wrench into some processes, as well as a need for the process to be fairly swift and probably done at runtime instead of as a postprocessing SQL query. R&D needed, but this is probably the best way foward.

daelba commented 5 years ago

Completing dates to YYYY-MM-DD during importing to the vector tile system seems as the best solution.

danrademacher commented 5 years ago

Gregor and I just worked through a better approach here: A process by which we convert all dates to floats as a "sortable date" element.

@gregallensworth will write some notes here on what that approach would mean.

gregallensworth commented 5 years ago

I have been thinking about the time slider, and the larger context outside of four-digit years after the year 0 CE.

The Basic Need: Dates are Numbers

The discussion so far has been about ISO 8601 dates, specifically so client-side filtering could continue to use Mapbox GL's simple filtering system, which lacks regular expressions, function callbacks, etc. and offers little more than >= and <= comparisons.

The discussion so far has been about handling truncated dates, potentially rounding them out to proper YYYY-MM-DD strings. But I suspect that this may be trying to plaster over a larger crack.

String comparisons are limited in their scope: 999 is greater than 1234 and -999 comes before -100. If dates could be represented as numbers, then comparisons would be straightforward regardless of the CE/BCE break.

Converting Dates to Numbers

Requirements for a date-to-number conversion system:

Candidate Algorithm: Julian Date

Shortcoming:

References:

https://en.wikipedia.org/wiki/Julian_day

https://aa.usno.navy.mil/data/docs/JulianDate.php

https://www.postgresql.org/docs/9.1/functions-formatting.html

http://adampresley.github.io/2009/12/10/from-gregorian-to-julian-dates-in-javascript.html

https://www.php.net/manual/en/function.gregoriantojd.php

https://pypi.org/project/julian/

Scope of Implementation

OHM Editing: Dates must be limited to no earlier than 4713 BCE.

Server-side: Vector tile outputs would include new output fields: start_date_julian and end_date_julian or similar, being the date fields in Julian Date.

Client-side: Time slider would convert the currently selected date(s) (at present two dates, the Jan 1 and Dec 31 of the selected year) into Julian Date, and perform comparison against the Julian date equivalents. This being a mathematical calculation, BC/BCE dates should be no problem.

This is separate from the discussion of handling missing day-month or day components from a date which the OHM editor is entering. That could go either way:

bertdeb commented 5 years ago

I am very interested in this conversation as we have faced similar issues with WikiWar. Due to the nature of the Defacto Social Research Engine upon which WikiWar is built, we also have to contend with fictional calendars (Star Dates, Shire Reckoning, Westerosi time, etc.). In the end we determined that projects would need to specify the calendar in use, and in the case of non-ISO 8601 dates, provide a mechanism for specifying the rules (schema) of the custom calendar.

Getting back to OHM, there may be value in requiring a calendar=julian tag or something similar. This opens up the question of whether OHM should be specifying calendar schemas, but if a given schema is posted on the wiki so people understand what a given date with calendar=julian actually means, it could solve this immediate problem.

It also opens the door to specifying additional calendars (like Islamic calendar AH/BH, various Chinese calendars, etc.) down the road...

mojodna commented 5 years ago

❤️ converting dates from whatever calendar (defaulting to Gregorian if no calendar= tag is present) is used for date stamps in the tags into Julian Dates for use in vector tile consumers. If people want to include Julian Dates when tagging, awesome, otherwise we handle the conversions and maybe end up a bit off.

I don't like using -01-01 for circa dates, but I think we already have a tagging solution in place (truncated dates) so we don't need to impose structure. Maybe conversion to circa=<days> or something?

@bertdeb how (if it does) does WikiWar handle approximate dates, potentially ranging over multiple years?

danrademacher commented 5 years ago

Discussed this just now with GreenInfo dev team and @gregallensworth is going to proceed with the changes needed to use Julian date as the sortable fields for start and end, while leaving untouched the source data in OHM. And for any "info" features or displays of an element that show dates, we would show whatever is in the source. So "2012" or "2012-05" would be displayed as such.

For now, we're assuming Gregorian dates as the default and deferring to future iterations further refinements that would allow for other calendar types. But having a single sortable format should create at least a path to future additional methods to deal with divergent formats, etc.

timwaters commented 5 years ago

I think we do want to map older things than 4713 BCE. Basically all prehistoric things as well as most topographical objects. Can there be negative Julian dates?
if not, we shouldn't really limit what people can map due to limited support of applications.

With https://github.com/OpenHistoricalMap/osm2pgsql/blob/master/ohm_tags_transform.lua parsing code for OHM we converted string dates into integers rather than postgres date format because of this limitation as postgres couldn't store old dates, but then it was possible to sort and filter with integers.

gregallensworth commented 5 years ago

Support for negative Julian dates is variable depending on the implementation.

As such, @timwaters concern about dates before 0 J is valid, if OHM is expected to work with such dates.

Looking at that LUA code, the year is split off and stored as an integer. That keeps things nice and simple, if we could accept year as the best resolution. Ideally, we want to express start/end dates in a way amenable for filtering down to the day.

Candidate Algorithm: Year as Float

Idea here is to treat years as float/decimal numbers, with the month/day forming the decimal portion.

Such float dates would have unequivocal comparisons, and could express geological spans.

Concerns and challenges:

References: https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/decimal_date

jeffreyameyer commented 5 years ago

I agree that we want to map stuff older than 0 Julian, but suggest that will be a very small % of what we want to map out of the gates. Suggest we take a raincheck on solving that now, in the interest of shipping something sooner. Does that work?

On Tue, May 14, 2019, 7:27 PM Greg Allensworth notifications@github.com wrote:

Support for negative Julian dates is variable depending on the implementation.

  • JavaScript supports negative Julian Days back to some 500,000 BCE
  • PHP will silently cap to 0 (Jan 1 4713 BCE) but not generate an error.
  • PostgreSQL generates an error due to timestamp being out of range. SELECT to_char('0001-01-01'::date - interval '5000 years', 'J');

As such, @timwaters https://github.com/timwaters concern about dates before 0 J is valid, if OHM is expected to work with such dates.

Looking at that LUA code, the year is split off and stored as an integer. That keeps things nice and simple, if we could accept year as the best resolution. Ideally, we want to express start/end dates in a way amenable for filtering down to the day.

*Candidate Algorithm: Year as Float

Idea here is to treat years as float/decimal numbers, with the month/day forming the decimal portion.

  • 1999-01-01 would be 1999.000000 as a start_date and 1999.002739 as an end_date, covering a 24-hour period.
  • The 24-hour period of 2000-01-01 would be 2000.000000 through 2000.0027322 since 2000 was a leap year with 366 days
  • BCE years would be counted "backwards" that December is less negative than January, e.g. -473.000 would be December 31 of 473 BCE and -473.99726 would be January 1 473 BCE.

Such float dates would have unequivocal comparisons, and could express geological spans.

Concerns and challenges:

  • Not a known mechanism with established acceptance and implementations. I could only find one implementation, for R, and it is limited to positive ISO dates and internally uses Unix epoch.
  • Creating an implementation sounds not difficult, but is probably more nuanced than it sounds off the cuff:
    • Existing date libraries may be ill-equipped to help with some of this heavy lifting, e.g. tm_yday for a year in 100,000 BCE implies that the underlying library could support this date so as to be able to calculate the number of days in that year... So probably have to roll-our-own Pl/PgSQL implementation of tm_yday
    • Nuances of the Gregorian calendar, e.g. the math for 1999, 2000, and 2100 are slightly different due to the nuances of leap years. Not problems, just nuances to heed when writing it.

References:

https://www.rdocumentation.org/packages/lubridate/versions/1.7.4/topics/decimal_date

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OpenHistoricalMap/openhistoricaltiles/issues/37?email_source=notifications&email_token=AALM4EXYPA7YZCY5SKO6ZK3PVLY7TA5CNFSM4HJQVRO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVMGJ5Q#issuecomment-492332278, or mute the thread https://github.com/notifications/unsubscribe-auth/AALM4EQHPVSIUZIF2NKFPBLPVLY7TANCNFSM4HJQVROQ .

danrademacher commented 5 years ago

@gregallensworth Let's go ahead with a try at the Year as Float approach you describe above. Do you want to break this out into separate issues that make most sense for you? I can also give it a try.

gregallensworth commented 5 years ago

1.

Some implementations of both padding out truncated dates and of rendering year-month-day as a decimal number for sorting.

https://github.com/OpenHistoricalMap/decimaldate-python https://github.com/OpenHistoricalMap/decimaldate-python https://github.com/OpenHistoricalMap/DateFunctions-plpgsql

2.

I have updated vector tiles to have new fields start_decdate and end_decdate

3.

I have updated the MBGL TimeSlider and its demo to use the decimal sorting, and also to support negative dates.

https://openhistoricalmap.github.io/openhistoricaltiles/mbgl-control-timeslider/demo/#14.600/40.80623/-73.91894/-500,-1000--250

https://openhistoricalmap.github.io/openhistoricaltiles/leaflet-control-mbgltimeslider/demo/#14.600/40.80623/-73.91894/-750,-1000--500

We could use some more BCE content with proper date formatting to really test this and illustrate this. And this should pave the way for a day-resolution time slider eventually.

danrademacher commented 5 years ago

Here is the area that @daelba mapped to 1830 resolution: https://openhistoricalmap.github.io/openhistoricaltiles/mbgl-control-timeslider/demo/#15.000/49.55012/17.72994/1830,1800-1900

This area, http://www.openhistoricalmap.org/#map=16/49.5476/17.7362&layers=H, includes dozens of features with start_date=1830. Where in the past we had no map elements showing at all, we now have a few. But I would expect many more: image

Nearly every item I clicked on here has start_date=1830, except the cathedral, which is start_date=1764.

Given that we're now padding out dates, any thoughts on what would prevent these buildings to show up?

gregallensworth commented 5 years ago

I see it: the start_decdate and end_decdate are NULL if the date is invalid, and Tessera omits this key entirely from the feature. That is to say: a feature which has no end_date set, will have no end_decdate property at all, blank or otherwise.

I have reworked the logic accordingly.

https://openhistoricalmap.github.io/openhistoricaltiles/mbgl-control-timeslider/demo/#16.238/49.54809/17.73493/1830,1825-1960

Not a lot of change to see visually, but some. In 1860 you can see the Judische Schule appear, and in 1863 you can see a rather large building appear somewhat far to the west. In 1876 you can see a building and a small bit of park disappear. In 1921 you can see a small park to the west disappear.

danrademacher commented 5 years ago

looks great!

mfd9vbcIa6