ProfessionalWiki / WikibaseEdtf

Wikibase extension that adds support for the Extended Date/Time Format (EDTF) via a new data type
https://wikibase.consulting/wikibase-edtf
GNU General Public License v2.0
7 stars 6 forks source link

Make dates available for SPARQL #3

Open mzeinstra opened 3 years ago

mzeinstra commented 3 years ago

I don't see this as an open ticket yet.

We discussed that the MVP would be to expose the lowest date of an EDTF value to SPARQL in Wikibase. Given possibilities this could be the highest and lowest values.

This is to make the current operators on dates available in SPARQL.

JeroenDeDauw commented 3 years ago

2021-02-12 status: high priority

JeroenDeDauw commented 3 years ago

@mzeinstra what to do for seasons? Start month? End month? Middle month? Both start and end month?

mzeinstra commented 3 years ago

Just out of the top of my head. Wouldn’t last day and first day of the season not work?

On Sun, 14 Mar 2021 at 01:19, Jeroen De Dauw @.***> wrote:

@mzeinstra https://github.com/mzeinstra what to do for seasons? Start month? End month? Middle month? Both start and end month?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ProfessionalWiki/WikibaseEdtf/issues/3#issuecomment-798806626, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK54J7DWKLBOVD47XHHOJTTDP6J7ANCNFSM4XO5KB2Q .

JeroenDeDauw commented 3 years ago

Implementation wise we can make it all happen. Usecase wise the first and last day for seasons works for some cases but not for others. Some examples where it does not work:

I also suspect that using start and end month is better than using the days. The dates also come with precision, and month precision is closer to season than day precision.

If we go with multiple values (ie start and end month), then perhaps it might even make sense to include all months part of the season?

JeroenDeDauw commented 3 years ago

Another example of a use case that gets messed up, this time applicable to intervals, when using start and end time:

Imagine having interval 1900-1999 (20th century). If you do a query finding times between 1890 and 1910, you will find the item via the starting time. But if you query for all times between 1910 and 1930, you will not find the item at all. It is not clear to me how to solve that, and it might not be possible without adding features to Blazegraph. And I can imagine, that for the people running the queries, it might be best if intervals (and possibly seasons and sets) are skipped. It is possible we make queries less usable for them by including these. Hard to tell.

mzeinstra commented 3 years ago

I agree on adding the months in a season, that seems to be the best way forward.

For intervals we could add all years in a sequence, but that might not be the best solution in this. Your proposal is to not expose intervals at all?

I assume we will also expose the 'raw' EDTF string as well?

JeroenDeDauw commented 3 years ago

Your proposal is to not expose intervals at all?

Yes. That could be added without all the guessing once a concrete usecase materializes, which might well be never.

I assume we will also expose the 'raw' EDTF string as well?

At the moment not. The EDTF is being translated into standard Wikibase time values, so we get maximum compatibility with tools. I am not sure what the implications of also exposing it in RDF as a string are, and am worried that having some values for P123 be a date and some be a string is a big no-no. So this would need some investigation, which will either take a lot longer than the technical work itself, or uncover the need for a bunch of extra technical work.

mzeinstra commented 3 years ago

I agree, that is a good start within the limited time.

I am afraid that not exposing the EDTS-as-string might also close the route of exporting EDTF data from the platform. Would that be true, or would it only be for the Sparql?

JeroenDeDauw commented 3 years ago

I was just talking about SPARQL and RDF. The standard MediaWiki and Wikibase export mechanisms contain the string version. Example of entity JSON with some EDTF strings in it via the web API: http://edtf.wikibase.wiki/w/api.php?action=wbgetentities&ids=P1

So not having the string in SPARQL or RDF does not create an export issue. Indeed, the RDF does not contain the entirety of the native Wikibase time values.

mzeinstra commented 3 years ago

Ah ok, than that use case doesn't exists anymore thanks.

JeroenDeDauw commented 3 years ago

Implementation should be done. Now we will test if the SPARQL queries actually work.

Example time-based query on Wikidata: https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3Fdate_of_birth%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ146.%0A%20%20%20%20%20%20%20%20%3Fitem%20wdt%3AP569%20%3Fdate_of_birth.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%20%20FILTER%28%28%3Fdate_of_birth%20%3E%20%222015-01-01%22%5E%5Exsd%3AdateTime%29%20%26%26%20%28%3Fdate_of_birth%20%3C%20%222021-06-01%22%5E%5Exsd%3AdateTime%29%29%0A%7D%0AORDER%20BY%20DESC%28%3Fdate_of_birth%29

SELECT ?item ?itemLabel ?date_of_birth WHERE {
  ?item wdt:P31 wd:Q146.
        ?item wdt:P569 ?date_of_birth.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  FILTER((?date_of_birth > "2015-01-01"^^xsd:dateTime) && (?date_of_birth < "2021-06-01"^^xsd:dateTime))
}
ORDER BY DESC(?date_of_birth)
JeroenDeDauw commented 3 years ago

Working as expected \o/

image

image

JeroenDeDauw commented 3 years ago

I think we can close this task and open more specific tickets if some further tweaks are needed.

mzeinstra commented 3 years ago

Agreed, we will test this and get back to you. with specific tickets.

mzeinstra commented 3 years ago

@JeroenDeDauw can we use the environment that you set up to test this? http://edtf.wikibase.wiki/wiki/Property:P1 I don't see the query service available there.

JeroenDeDauw commented 3 years ago

I will update the demo instance later today so you can test queries there later today, and ping you once it happened.

JeroenDeDauw commented 3 years ago

Query service now available on demo instance: http://edtf.wikibase.wiki:8282/#SELECT%20%3Fitem%20%3FitemLabel%20%3Fedtf_date%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20%3Fedtf_date.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D%0AORDER%20BY%20DESC%28%3Fedtf_date%29

andrawaag commented 3 years ago

I was pointed to this by @mzeinstra. Not having the EDTF representation in Blazegraph might be an issue for downstream pipelines that ingest or export to and from CIDOC-CRM represented data values. CIDOC-CRM is RDF. I am wondering if this issue could be resolved if the dates in EDTF are already in Wikibase native transformed in the native date stamp of Wikibase, where the EDTF representation is maintained as a qualifier to those statements. This would look something like this:

https://safsandbox1.wiki.opencura.com/wiki/Item:Q1 (This example is on wbstack which does not got the EDTF extension, I have selected string as data type.)

Wouldn't using qualifier maintain the integrity with EDTF also in the RDF representation?

mzeinstra commented 3 years ago

@andrawaag I thought you suggested creating a 'hidden' qualifier to contain the string and not the other way around. right?

andrawaag commented 3 years ago

No, I would not hide this. My point is that conceptually Wikibase consists of two redudent data layers, a relation model and a RDF model. We should not remove this redundancy. The RDF layer is crucial for information retrieval since querying wikibase through the API is suboptimal. It is not possible to query Wikibase on both strings and statements. Here the WBQS is key. If there is a discrepancy between the two models, information retrieval will become difficult.

THere is indeed a difficulty in EDTF is one would like to do sorting, especially if the model captures only the string representation. If the sollution is transforming the EDTF time string to a XSD:Datetime value I would do that in both layers, and the suggestion I made is one possibility.

But I would not hide that, on the contrary, that would lead to downstream confusion,

mzeinstra commented 3 years ago

@JeroenDeDauw I'll have a further discussion with @andrawaag on this.

In the meantime. I was wondering if it is possible to something like this with Blazegraph:

PersonX birthDate “~2021-XX-05?“^^xsd:string . PersonX birthDate “2021”^^xsd:datetime .

That way we could move the responsibility of searching through the dates to the person creating the query.

JeroenDeDauw commented 3 years ago

I am not familiar with RDF or Blazegraph, so can't tell what is appropriate or what will work without prior investigation.

What I do know is what Wikibase outputs as RDF for dates:

<rdf:Description rdf:about="http://edtf.wikibase.wiki/value/103522c70e98676f031e23d4ed0c5ea6">
  <rdf:type rdf:resource="http://wikiba.se/ontology#TimeValue"/>
  <wikibase:timeValue rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2006-05-01T00:00:00Z</wikibase:timeValue>
  <wikibase:timePrecision rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">10</wikibase:timePrecision>
  <wikibase:timeTimezone rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">0</wikibase:timeTimezone>
  <wikibase:timeCalendarModel rdf:resource="http://www.wikidata.org/entity/Q1985727"/>
</rdf:Description>

See the bottom of: view-source:http://edtf.wikibase.wiki/wiki/Special:EntityData/Q1.rdf

This includes

It does not include before or after, both part of standard Wikibase time values. So you cannot get access to the full data via RDF or SPARQL in stock Wikibase.

I suspect we can add more fields to the above RDF Description without breaking the query service. So we could add the EDTF string as such. I am not sure it is "correct" to do this from an RDF perspective. And I am unsure to which degree the information will be queryable via SPARQL.

So to change things here, I either need a specification of what the desired RDF output is, or I first need to investigate these topics more so I can make an informed recommendation.

mzeinstra commented 3 years ago

I had a discussion with Andra on this functionality.

To be able to have the proper functionality for export in RDF and for presentation in SPARQL we that you StringValue aftere you add the TimeValues here: https://github.com/ProfessionalWiki/WikibaseEdtf/blob/1d9f772f60e981fb630ffe1b83604a9724b47bf7/src/Services/RdfBuilder.php#L31

As you say it will most likely not break BlazeGraph and it will help us to present the EDTF string in SPARQL as well use the TTL and RDF export possibility in e.g. http://edtf.wikibase.wiki/wiki/Special:EntityData/Q2.ttl

Would that work @JeroenDeDauw ?

After this export I will ask @andrawaag and Jose Labra to verify if that is working as expected.

JeroenDeDauw commented 3 years ago

To be able to have the proper functionality for export in RDF and for presentation in SPARQL we that you StringValue aftere you add the TimeValues here

huh?

mzeinstra commented 3 years ago

Do you want to have a call on this today? e.g. at 16:00?

JeroenDeDauw commented 3 years ago

I send you an invite

JeroenDeDauw commented 3 years ago

The RDF now also contains the plain EDTF as a string.

image

The above item results in: https://pastebin.com/gGiRPGAb. (Search for [2022] to find the plain EDTF)

(Not deployed on demo system yet)

mzeinstra commented 3 years ago

I've asked Andra and Jose if this works for their use cases as well. Could you make this available on the demo system? So we can test Sparql as well.

JeroenDeDauw commented 3 years ago

Could you make this available on the demo system?

Done

mzeinstra commented 3 years ago

Interesting.

I see that it appears in the ttl files e.g. (http://edtf.wikibase.wiki/wiki/Special:EntityData/P1.ttl)

wdt:P1 "[2020, 2021, 2022, 2023, 2024]"^^xsd:edtf,
        "2020-01-01T00:00:00Z"^^xsd:dateTime,
        "2021-01-01T00:00:00Z"^^xsd:dateTime,
        "2022-01-01T00:00:00Z"^^xsd:dateTime,
        "2023-01-01T00:00:00Z"^^xsd:dateTime,
        "2024-01-01T00:00:00Z"^^xsd:dateTime,

But then I expect the following to works too, right? @andrawaag

SELECT ?item ?itemLabel ?edtf_date WHERE {
  ?item wdt:P1 "[2020, 2021, 2022, 2023, 2024]"^^xsd:edtf.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ASC(?edtf_date)
andrawaag commented 3 years ago

Jose and I reviewed the TTL file at http://edtf.wikibase.wiki/wiki/Special:EntityData/Q1.ttl. By adding the EDTF type as a string to the RDF representation allows the roundtripping which is crucial to maintain the data integrity inside Wikibase.

We do have some concerns though regarding the transformation from edtf to xsd:datetime. In the current implementation, edtf is stated as xsd:edtf, which is incorrect. EDTF is not part of XSD. Can this be changed to the applicable namespace? e.g. (https://id.loc.gov/datatypes/edtf.html)

Would it be possible to document the rules that are used to transform between edtf and xsd:datetime. For example in the above cite rdf representation of Q1 we see: wdt:P1 "2006-24"^^xsd:edtf, "2006-12-01T00:00:00Z"^^xsd:dateTime, "2006-01-01T00:00:00Z"^^xsd:dateTime, "2006-02-01T00:00:00Z"^^xsd:dateTime,

2006-24 is to represent winter of 2006. This seems to be transformed to January, February and December 2006. Is that correct, because one could argue that it actually is 2005-12, 2006-1, 2006-2 or 2006-12, 2007-1, 2007-2. If you are from e.g. Australia or Chile, that might be 2006-7, 2006-8, 2006-9.

JeroenDeDauw commented 3 years ago

We do have some concerns though regarding the transformation from edtf to xsd:datetime. In the current implementation, edtf is stated as xsd:edtf, which is incorrect. EDTF is not part of XSD. Can this be changed to the applicable namespace? e.g. (https://id.loc.gov/datatypes/edtf.html)

https://github.com/ProfessionalWiki/WikibaseEdtf/issues/13

GreenReaper commented 3 years ago

Your proposal is to not expose intervals at all?

Yes. That could be added without all the guessing once a concrete usecase materializes, which might well be never.

It is true that for some it could be problematic, but for the day within a month or month within a year cases it's useful to have at least one date. Here's my use case:

I want a database of multi-day events and instances of those so I can show them on maps, lists, etc. I might also want to transfer this data into metadata for other consumers, e.g. JSON-LD Event.

In some of these cases, I want to retrieve the most recent or upcoming instance of a annual event, that would be determined via a SPARQL query.

Previously I'd intended to use start time statements with end time qualifiers to allow for the possibility of cancellation or rescheduling, which I also want to record and (in some cases) show.

The new type might be better for this, because the period/interval itself is the subject of a single claim.

However, without a datetime value in SPARQL I'd likely have to do my own sorting through items and parsing of dates (or, possibly, pass in a lot of matching text-mode filters) to get the right one.

This could include uncertain dates, e.g. "Eurofurence (likely August 2022)". I'd normally consider this to fall on the start of that month, or - less preferred, but maybe nice to have as well - the end of the month.

Presumably it'd use day precision per the example above, or less for month or year intervals. (Failing that, given the distribution of time zones, it might be best to use ~11:00 UTC on the day rather than midnight, to avoid being in a different day in some locations.)