KorAP / Krill

:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
BSD 2-Clause "Simplified" License
16 stars 3 forks source link

Date ranges and date additions in I5:<creatDate> #17

Open luengen opened 8 years ago

luengen commented 8 years ago

Need for possibility to specify and process date ranges or date additions in the I5:<creatDate&gt, field. E.g. according to specification of the BOT-ent field (predecessor of <creatDate> in the BOT Manual by Doris al-Wadi (p.22f.), use of a date range in at least on corpus in DeReKo, and use in the Sprache 1933-1945 project. Concerns <creatDate> of collections, but also of single texts.

luengen commented 7 years ago

There will be a new corpus that also uses date ranges in creatDate: The Digitale Bibliothek-Korpus currently converted by Stefan Pernes.

luengen commented 7 years ago

Nils (Akron) said the envisaged specification will only concern virtual corpus building using creatDate (i.e. not search or anything). Nachklapp HL: But at least in C2 there is Jahres/Jahrzent etc. -Ansicht where date ranges would be relevant as well.

Akron commented 7 years ago

Krill (or rather KorAP::XML::Krill) currently deals with date ranges in the same way as C2, with the exception that we support date granularity (I don't know if this is supported by C2). That means we index "1893/06" as being "18930600". Whenever a virtual corpus requests all documents of june 1893, this document will be returned as well - but whenever a special date range inside is requested (like "1893/06/04 - 1893/06/07") the document is not returned. It is returned, however, when the beginning of the month is part of the requested range (like "1893/05/04 - 1893/06/07"). So - when it comes to dealing with date ranges, the challenge is not only of a technical nature but also regarding the behaviour for the creation of virtual corpora.

Storing of multiple dates is, I would say, possible without much changes to the code. The same is true for ranges as described above. When a range is given like "1893.06.-08." we could index multiple range dates like "18930600", "18930700", "18930800". That would work as described above. However, currently if someone searches for all documents created at "1893/06/05" this document would not be returned. If something is given like "1893.06.04-1893.07.05", we would need to index "18930604", "18930605", "18930606", ..., "18930704", "18930705" ... I don't know, if this would be sufficient.

P.S. An alternative way of storing ranges may be by storing dates as geometric objects and checking for collisions ... but I have no idea how this works in Lucene.

Akron commented 7 years ago

It seems to be possible to use DateRangeField for this particular task.

luengen commented 7 years ago

Date ranges may also offer a solutions for texts which have no known or reconstructible creat date or year. Some such texts are now in DeReKo, e.g. chat logs with clearly false dates or again texts frrom the Digital Library.

Akron commented 7 years ago

Would you somehow mark this information as "unclear" or "estimated"? Otherwise it may be confusing to mix these date information with other interpretation of time ranges.

luengen commented 7 years ago

true

Akron commented 6 years ago

After spending more time investigating spatial data structures I came to the conclusion that they may not work well in virtual corpora. So I guess we need to model this data in a more conservative way.

Akron commented 6 years ago

I have now a working example implementation in Krawfish Prototype, that is an extension of Lucenes legacy DateRangeQuery (see Schindler & Diepenbroek 2008). Date ranges can be stored in yyyy(-mm(-dd))--yyyy(-mm(-dd)) format. The granularity means: The undefined span below is completely covered, e.g. 2017-01--2018-03 includes everything from the first of january 2017 to the last day of march 2018. This differs from the current implementation with the 00 days in Krill. Per default, documents in a virtual corpus will be returned as intersection of date ranges. Example: When a document A is stored with the date range 2017-02--2017-04 and a document B is stored with a date range of 2017-03-14--2017-03-20, a document C is stored with a date 2017, and the query is pubDate>=2017-03 & pubDate<=2017-04, all documents will be returned, as the date ranges of the query and all documents overlap.

This can be expanded to support multiple date ranges per document and within queries, that exclude all documents with ranges that overlap the queried date range.