WolfgangFahl / ConferenceCorpus

ScientificEventCorpus
Apache License 2.0
1 stars 2 forks source link

Properties of Datasources are not aligned #46

Open tholzheim opened 2 years ago

tholzheim commented 2 years ago

The event and eventseries view is generated over the set of common properties of each datasource. Unfortunately for event the only property that is present in all datasource tables is source and the event series have no common property. Thus, the resulting views break some of the existing tests e.g. they rely on the existence of eventId as column in the event view.

Tested with:

    def testGetCommonViewDDLs(self):
        '''
        tests getCommonViewDDLs
        '''
        viewDDLs = EventStorage.getCommonViewDDLs()
        print("Generated view", viewDDLs)
        for viewName in "event", "eventseries":
            viewTableList = EventStorage.getViewTableList(viewName, exclude=None)
            props = [set([cr["name"] for cr in record["columns"]]) for record in viewTableList]
            propsInAll = set.intersection(*props)
            print(f"Shared properties of {viewName}:{propsInAll}")

Returns

Generated view {'event': 'CREATE VIEW event AS \n  SELECT source FROM event_confref\nUNION\n  SELECT source FROM event_wikidata\nUNION\n  SELECT source FROM event_dblp\nUNION\n  SELECT source FROM event_ceurws\nUNION\n  SELECT source FROM event_acm\nUNION\n  SELECT source FROM event_or\nUNION\n  SELECT source FROM event_orbackup\nUNION\n  SELECT source FROM event_orclonebackup\nUNION\n  SELECT source FROM event_orclone\nUNION\n  SELECT source FROM event_crossref\nUNION\n  SELECT source FROM event_wikicfp\nUNION\n  SELECT source FROM event_gnd', 'eventseries': 'CREATE VIEW eventseries AS \n  SELECT  FROM eventseries_confref\nUNION\n  SELECT  FROM eventseries_dblp\nUNION\n  SELECT  FROM eventseries_wikidata\nUNION\n  SELECT  FROM eventseries_gnd\nUNION\n  SELECT  FROM eventseries_crossref\nUNION\n  SELECT  FROM eventseries_acm\nUNION\n  SELECT  FROM eventseries_wikicfp\nUNION\n  SELECT  FROM eventseries_or\nUNION\n  SELECT  FROM eventseries_orbackup\nUNION\n  SELECT  FROM eventseries_orclone\nUNION\n  SELECT  FROM eventseries_orclonebackup'}
Shared properties of event:{'source'}
Shared properties of eventseries:set()

For example the problem occurs in the following function (if tested on the generated view) https://github.com/WolfgangFahl/ConferenceCorpus/blob/531122fd4ae15f84772d44c2116794b7ea01740d/tests/testCorpusLookup.py#L88 The MultiQuery on the event view uses the source and eventId. but with the generated view the eventId is not in the view.

WolfgangFahl commented 2 years ago

This is exactly what we'd like to achieve by reactivating the proceedings title parser again. Unfortunately for the last few weeks there has not been a single day where the nightly build would run thru and we are in a catch22 hen/egg position now which we need to get out of and avoid for the future.

WolfgangFahl commented 2 years ago

sqlquery -qp ./cc.yaml -qn YearAndOrdinalColumns -f github -en cc

show all year and ordinal columns and types

select year and ordinal columns

query

WITH tables AS (SELECT name tableName, sql 
FROM sqlite_master WHERE type = 'table' AND tableName NOT LIKE 'sqlite_%')
SELECT fields.name, fields.type, tableName
FROM tables CROSS JOIN pragma_table_info(tables.tableName) fields
where name in ("year","ordinal")
order by name

result

name type tableName
ordinal INTEGER event_ceurws
ordinal INTEGER event_tibkat
ordinal INTEGER event_gnd
ordinal TEXT event_wikidata
ordinal INTEGER event_or
ordinal TEXT event_orbackup
ordinal INTEGER event_orclone
ordinal TEXT event_orclonebackup
year INTEGER event_confref
year INTEGER event_ceurws
year INTEGER event_dblp
year INTEGER event_wikicfp
year INTEGER event_crossref
year INTEGER event_tibkat
year INTEGER event_gnd
year INTEGER event_wikidata
year INTEGER event_or
year INTEGER event_orbackup
year INTEGER event_orclone
year INTEGER event_orclonebackup
WolfgangFahl commented 2 years ago
ccUpdate --updateSource wikidata

Starting update of conference corpus database from wikidata cache ... Starting loading Wikidata ... loading Wikidata took 37.6 s update of conference corpus database from wikidata cacheWikidata: 8019 events 4261 eventseries took 37.6 s

{
    "acronym": "ISWC 2008",
    "country": "Germany",
    "countryId": "Q183",
    "dblpId": "conf/semweb/2008",
    "describedAtUrl": null,
    "doi": "10.1007/978-3-540-88564-1",
    "endDate": "2008-10-30T00:00:00",
    "eventId": "Q48026643",
    "eventInSeries": "International Semantic Web Conference",
    "eventInSeriesId": "Q6053150",
    "eventTitle": null,
    "followedById": null,
    "gndId": "10360484-4",
    "homepage": "http://iswc2008.semanticweb.org",
    "language": null,
    "location": "Kongresszentrum Karlsruhe",
    "locationId": "Q1781594",
    "mainSubject": "Semantic Web",
    "ordinal": 7,
    "ppn": "579171965",
    "proceedings": "http://www.wikidata.org/entity/Q98093643",
    "proceedingsLabel": "The Semantic Web - ISWC 2008: 7th International Semantic Web Conference, ISWC 2008, Karlsruhe, Germany, October 26-30, 2008. Proceedings",
    "source": "wikidata",
    "startDate": "2008-10-26T00:00:00",
    "title": "The 7th International Semantic Web Conference",
    "url": "http://www.wikidata.org/entity/Q48026643",
    "wikiCfpId": "1974",
    "year": 2008
}
WolfgangFahl commented 2 years ago
name type tableName
ordinal INTEGER event_ceurws
ordinal INTEGER event_tibkat
ordinal INTEGER event_gnd
ordinal INTEGER event_or
ordinal TEXT event_orbackup
ordinal INTEGER event_orclone
ordinal TEXT event_orclonebackup
ordinal INTEGER event_wikidata
year INTEGER event_confref
year INTEGER event_ceurws
year INTEGER event_dblp
year INTEGER event_wikicfp
year INTEGER event_crossref
year INTEGER event_tibkat
year INTEGER event_gnd
year INTEGER event_or
year INTEGER event_orbackup
year INTEGER event_orclone
year INTEGER event_orclonebackup
year INTEGER event_wikidata
WolfgangFahl commented 2 years ago
from ptp.ordinal import Ordinal
...
class CrossrefEvent(Event):
...
  def postProcess(self, eventInfo:dict) -> dict:
...
            Ordinal.addParsedOrdinal(rawEvent)
ccUpdate --updateSource crossref --sample "ICSE '18"

Starting update of conference corpus database from crossref cache ... Starting loading CrossRef ... read 55441 events in 2.0 s loading CrossRef took 8.0 s update of conference corpus database from crossref cachecrossref.org: 55441 events 1 eventseries took 8.0 s

{
    "acronym": "ICSE '18",
    "doi": "10.1145/3196478",
    "endDate": null,
    "eventId": "10.1145/3196478",
    "location": "Gothenburg Sweden",
    "lookupAcronym": null,
    "month": null,
    "name": "ICSE '18: 40th International Conference on Software Engineering",
    "number": null,
    "ordinal": 4,
    "source": "crossref",
    "sponsor": "SIGSOFT ACM Special Interest Group on Software Engineering\u21f9IEEE-CS Computer Society",
    "startDate": null,
    "theme": null,
    "title": "Proceedings of the 4th International Workshop on Software Engineering for Smart Cyber-Physical Systems",
    "url": "https://api.crossref.org/v1/works/10.1145/3196478",
    "year": null
}
WolfgangFahl commented 2 years ago
name type tableName
ordinal INTEGER event_ceurws
ordinal INTEGER event_tibkat
ordinal INTEGER event_gnd
ordinal INTEGER event_or
ordinal TEXT event_orbackup
ordinal INTEGER event_orclone
ordinal TEXT event_orclonebackup
ordinal INTEGER event_wikidata
ordinal INTEGER event_crossref
year INTEGER event_confref
year INTEGER event_ceurws
year INTEGER event_dblp
year INTEGER event_wikicfp
year INTEGER event_tibkat
year INTEGER event_gnd
year INTEGER event_or
year INTEGER event_orbackup
year INTEGER event_orclone
year INTEGER event_orclonebackup
year INTEGER event_wikidata
year INTEGER event_crossref
tholzheim commented 2 years ago

Updated post processing of extracted LoDs to convert ordinals to int: https://github.com/WolfgangFahl/ConferenceCorpus/blob/dc0f19b004f6b435d2458a7792aa2fafeaf9987c/corpus/datasources/openresearch.py#L357

name type tableName
ordinal INTEGER event_ceurws
ordinal INTEGER event_tibkat
ordinal INTEGER event_gnd
ordinal INTEGER event_or
ordinal INTEGER event_orbackup
ordinal INTEGER event_orclone
ordinal INTEGER event_orclonebackup
ordinal INTEGER event_wikidata
ordinal INTEGER event_crossref
year INTEGER event_confref
year INTEGER event_ceurws
year INTEGER event_dblp
year INTEGER event_wikicfp
year INTEGER event_tibkat
year INTEGER event_gnd
year INTEGER event_or
year INTEGER event_orbackup
year INTEGER event_orclone
year INTEGER event_orclonebackup
year INTEGER event_wikidata
year INTEGER event_crossref
WolfgangFahl commented 2 years ago
from ptp.ordinal import Ordinal
...
class DblpEvent(Event):
...
@staticmethod
    def postProcessLodRecord(rawEvent:dict):
...
            Ordinal.addParsedOrdinal(rawEvent)
``python
```bash
ccUpdate --updateSource dblp

Starting update of conference corpus database from dblp cache ... configureCorpusLookup callback called Starting loading dblp computer science bibliography ... Warning - using full /home/wf/.dblp/dblp.xml dataset ~9.1m records! Warning - using full /home/wf/.dblp/dblp.xml dataset ~9.1m records! loading dblp computer science bibliography took 7.5 s update of conference corpus database from dblp cachedblp: 50248 events 5454 eventseries took 7.6 s

{
    "acronym": "ISWC 2008",
    "booktitle": "ISWC",
    "doi": null,
    "ee": "https://ieeexplore.ieee.org/xpl/conhome/4840596/proceeding,http://www.computer.org/csdl/proceedings/iswc/2008/2637/00/index.html",
    "endDate": null,
    "eventId": "conf/iswc/2008",
    "isbn": "978-1-4244-2637-9",
    "mdate": "2019-10-16",
    "ordinal": 12,
    "publicationSeries": null,
    "series": "iswc",
    "source": "dblp",
    "startDate": null,
    "title": "12th IEEE International Symposium on Wearable Computers (ISWC 2008), September 28 - October 1, 2008, Pittsburgh, PA, USA",
    "url": "https://dblp.org/db/conf/iswc/iswc2008.html",
    "year": 2008
}
WolfgangFahl commented 2 years ago
ccUpdate --updateSource wikicfp

Starting update of conference corpus database from wikicfp cache ... configureCorpusLookup callback called Starting loading WikiCFP ... loading WikiCFP took 13.3 s update of conference corpus database from wikicfp cacheWikiCFP: 90339 events 6019 eventseries took 13.3 s

{
    "Final_Version_Due": null,
    "Notification_Due": null,
    "Submission_Deadline": "2008-05-16T00:00:00",
    "acronym": "ISWC 2008",
    "deleted": false,
    "endDate": "2008-10-30T00:00:00",
    "eventId": "1974",
    "eventType": "Conference",
    "locality": "Karlsruhe, Germany",
    "ordinal": null,
    "series": "International Semantic Web Conference",
    "seriesId": "1769",
    "source": "wikicfp",
    "startDate": "2008-10-26T00:00:00",
    "title": "ISWC  2008 : International Semantic Web Conference",
    "url": "http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=1974",
    "wikiCfpId": 1974,
    "year": 2008
}