Open tholzheim opened 2 years ago
For the events for the first steps we only need the ordinals, years and lookup acronym.
Proceedings contains headers and ordinals, years and lookup acronym data. Steps: 1) Set of ordinals and years from all sources (set of pairs) 2) Different set conditions can apply 1) complete: first and last known and in between complete 1) missing events: 1) consistent frequency 1) inconsistent frequency in relation to its neighbors 1) not computable with supplied data and simple algorithm → do not create ghost events
First format is excel
When done openRefine mapping would be fine
The completion of the year and ordinal pairs is with simple algorithms only possible for series with a consistent frequency. For records that have a inconsistent frequency the completion can not be done with certainty. For example the AAAI records have a gap between (1999,16) and (2006,21) from the ordinals we know that 4 event are missing. We also have records with the year 2002,2003,2004,2005 but without ordinal. Since the time span is irregular for a consistent frequency and we know for four years that we have the corresponding records we could assume that the corresponding years are the once we are looking for. But manual validation shows that this is not the case.
Therefore, omitting the year and just provide the ordinal would be an alternative to avoid the guess work on the year.
year | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 |
---|---|---|---|---|---|---|---|---|
ordinal | 16 | 21 | ||||||
Known years without Ordinal | ✓ | ✓ | ✓ | ✓ | ||||
Correct years | ✓ | ✓ | ✓ | ✓ | ||||
Correct ordinals | 16 | 17 | 18 | 19 | 20 | 21 |
One of the biggest problems for the completion is that often multiple ordinals exist for a year even though the event does at least once. This is mainly due to joined events and the resulting ordinal extraction problem.
Sorting by year and ensuring that the ordinal is monotonically increasing solves this issue for some of the series by ignoring those duplicate/incorrect records.
@tholzheim thx for the analysis. Please create a separate issue for the ordinal problem
Create compatible spreadsheets that can be used by orapi and by pyOnlineSpreadsheetEditing google sheet wikidata import. There should be sheets for: 1) Event series 2) Event 3) Proceedings 4-n) Each data source n+1) wikidata metadata mapping n+2) smw metadata mapping