WolfgangFahl / ConferenceCorpus

ScientificEventCorpus
Apache License 2.0
1 stars 2 forks source link

Spreadsheet Format for Event Series #50

Open tholzheim opened 2 years ago

tholzheim commented 2 years ago

Create compatible spreadsheets that can be used by orapi and by pyOnlineSpreadsheetEditing google sheet wikidata import. There should be sheets for: 1) Event series 2) Event 3) Proceedings 4-n) Each data source n+1) wikidata metadata mapping n+2) smw metadata mapping

tholzheim commented 2 years ago

For the events for the first steps we only need the ordinals, years and lookup acronym.

Proceedings contains headers and ordinals, years and lookup acronym data. Steps: 1) Set of ordinals and years from all sources (set of pairs) 2) Different set conditions can apply 1) complete: first and last known and in between complete 1) missing events: 1) consistent frequency 1) inconsistent frequency in relation to its neighbors 1) not computable with supplied data and simple algorithm → do not create ghost events

tholzheim commented 2 years ago

First format is excel

tholzheim commented 2 years ago

When done openRefine mapping would be fine

tholzheim commented 2 years ago

The completion of the year and ordinal pairs is with simple algorithms only possible for series with a consistent frequency. For records that have a inconsistent frequency the completion can not be done with certainty. For example the AAAI records have a gap between (1999,16) and (2006,21) from the ordinals we know that 4 event are missing. We also have records with the year 2002,2003,2004,2005 but without ordinal. Since the time span is irregular for a consistent frequency and we know for four years that we have the corresponding records we could assume that the corresponding years are the once we are looking for. But manual validation shows that this is not the case.

Therefore, omitting the year and just provide the ordinal would be an alternative to avoid the guess work on the year.

year 1999 2000 2001 2002 2003 2004 2005 2006
ordinal 16 21
Known years without Ordinal
Correct years
Correct ordinals 16 17 18 19 20 21

Multiple Ordinals per Year

One of the biggest problems for the completion is that often multiple ordinals exist for a year even though the event does at least once. This is mainly due to joined events and the resulting ordinal extraction problem.

Sorting by year and ensuring that the ordinal is monotonically increasing solves this issue for some of the series by ignoring those duplicate/incorrect records.

WolfgangFahl commented 2 years ago

@tholzheim thx for the analysis. Please create a separate issue for the ordinal problem