acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
370 stars 252 forks source link

paper -> proceedings -> event links in wikidata #1825

Open WolfgangFahl opened 2 years ago

WolfgangFahl commented 2 years ago

For my research i try to trace from paper to events (conferences) The following SPARQL query gives some good results but the result seems to be incomplete. How could this situation be improved?

# ACL Anthology article ID 
SELECT ?article ?articleLabel ?aclId ?publishedIn ?publishedInLabel ?event ?eventLabel WHERE {
  #ACL Anthology article ID 
  ?article wdt:P7505 ?aclId.
  ?article rdfs:label ?articleLabel .
  #?aclIdStatement (ps:P7505) ?aclId.
  ?article wdt:P1433 ?publishedIn.
  ?publishedIn rdfs:label ?publishedInLabel .
  OPTIONAL {
     # is proceedings from
     ?publishedIn wdt:P4745 ?event.
     ?event rdfs:label ?eventLabel.
  }
}

try it

WolfgangFahl commented 2 years ago

I have used the sparqlquery command line tool from https://github.com/WolfgangFahl/pyLoDStorage to show the details of the query which is name "ACL-Paper2Event" in https://github.com/WolfgangFahl/pyLoDStorage/blob/master/sampledata/scholia.yaml:

sparqlquery -qp scholia.yaml -qn "ACL-Paper2Event" -f github

ACL-Paper2Event

query

# ACL Anthology article ID 
SELECT ?article ?articleLabel ?aclId ?publishedIn ?publishedInLabel ?event ?eventLabel WHERE {
  #ACL Anthology article ID
  ?article wdt:P7505 ?aclId.
  ?article rdfs:label ?articleLabel .
  #?aclIdStatement (ps:P7505) ?aclId.
  ?article wdt:P1433 ?publishedIn.
  ?publishedIn rdfs:label ?publishedInLabel .
  #OPTIONAL {
     # is proceedings from
     ?publishedIn wdt:P4745 ?event.
     ?event rdfs:label ?eventLabel.
  #}
} LIMIT 50

result

article articleLabel aclId publishedIn publishedInLabel event eventLabel
Q79020060 Common Voice: A Massively-Multilingual Speech Corpus 2020.lrec-1.520 Q95997327 Proceedings of The 12th Language Resources and Evaluation Conference Q61919909 12th Conference on Language Resources and Evaluation
Q79020060 Common Voice: A Massively-Multilingual Speech Corpus 2020.lrec-1.520 Q95997327 Proceedings of The 12th Language Resources and Evaluation Conference Q61919909 12th Conference on Language Resources and Evaluation
Q79020060 Common Voice: A Massively-Multilingual Speech Corpus 2020.lrec-1.520 Q95997327 Proceedings of The 12th Language Resources and Evaluation Conference Q61919909 12th Conference on Language Resources and Evaluation
Q79020060 Common Voice: A Massively-Multilingual Speech Corpus 2020.lrec-1.520 Q95997327 Proceedings of The 12th Language Resources and Evaluation Conference Q61919909 12th Conference on Language Resources and Evaluation
Q79020060 Common Voice: A Massively-Multilingual Speech Corpus 2020.lrec-1.520 Q95997327 Proceedings of The 12th Language Resources and Evaluation Conference Q61919909 12th Conference on Language Resources and Evaluation
Q79020060 Common Voice: A Massively-Multilingual Speech Corpus 2020.lrec-1.520 Q95997327 Proceedings of The 12th Language Resources and Evaluation Conference Q61919909 12th Conference on Language Resources and Evaluation
Q79020060 Common Voice: A Massively-Multilingual Speech Corpus 2020.lrec-1.520 Q95997327 Proceedings of The 12th Language Resources and Evaluation Conference Q61919909 12th Conference on Language Resources and Evaluation
Q79020060 Common Voice: A Massively-Multilingual Speech Corpus 2020.lrec-1.520 Q95997327 Proceedings of The 12th Language Resources and Evaluation Conference Q61919909 12th Conference on Language Resources and Evaluation
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q61895831 The word analogy testing caveat N18-2039 Q55434859 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Q75696024 The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Q110887400 The Power of Scale for Parameter-Efficient Prompt Tuning 2021.emnlp-main.243 Q109517629 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Q109517651 The 2021 Conference on Empirical Methods in Natural Language Processing
Q110887400 The Power of Scale for Parameter-Efficient Prompt Tuning 2021.emnlp-main.243 Q109517629 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Q109517651 The 2021 Conference on Empirical Methods in Natural Language Processing
Q110887400 The Power of Scale for Parameter-Efficient Prompt Tuning 2021.emnlp-main.243 Q109517629 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Q109517651 The 2021 Conference on Empirical Methods in Natural Language Processing
Q110887400 The Power of Scale for Parameter-Efficient Prompt Tuning 2021.emnlp-main.243 Q109517629 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Q109517651 The 2021 Conference on Empirical Methods in Natural Language Processing
Q108673464 Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia 2020.emnlp-demos.4 Q108673475 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Q82290350 The 2020 Conference on Empirical Methods in Natural Language Processing
Q108673464 Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia 2020.emnlp-demos.4 Q108673475 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Q82290350 The 2020 Conference on Empirical Methods in Natural Language Processing
Q108673464 Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia 2020.emnlp-demos.4 Q108673475 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Q82290350 The 2020 Conference on Empirical Methods in Natural Language Processing
Q108673464 Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia 2020.emnlp-demos.4 Q108673475 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Q82290350 The 2020 Conference on Empirical Methods in Natural Language Processing
Q108673464 Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia 2020.emnlp-demos.4 Q108673475 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Q82290350 The 2020 Conference on Empirical Methods in Natural Language Processing
Q108673464 Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia 2020.emnlp-demos.4 Q108673475 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Q82290350 The 2020 Conference on Empirical Methods in Natural Language Processing
Q108673464 Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia 2020.emnlp-demos.4 Q108673475 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Q82290350 The 2020 Conference on Empirical Methods in Natural Language Processing
Q108673464 Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia 2020.emnlp-demos.4 Q108673475 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Q82290350 The 2020 Conference on Empirical Methods in Natural Language Processing
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q107060118 The Danish Gigaword Corpus 2021.nodalida-main.46 Q107059887 Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021 Q102274071 The 23rd Nordic Conference on Computational Linguistics
Q105730737 DanNet2: Extending the coverage of adjectives in DanNet based on thesaurus data (project presentation) 2021.gwc-1.31 Q105730699 Proceedings of the 11th Global Wordnet Conference Q105730832 The 11th Global WordNet Conference
Q105730737 DanNet2: Extending the coverage of adjectives in DanNet based on thesaurus data (project presentation) 2021.gwc-1.31 Q105730699 Proceedings of the 11th Global Wordnet Conference Q105730832 The 11th Global WordNet Conference
Q105730737 DanNet2: Extending the coverage of adjectives in DanNet based on thesaurus data (project presentation) 2021.gwc-1.31 Q105730699 Proceedings of the 11th Global Wordnet Conference Q105730832 The 11th Global WordNet Conference
Q105730737 DanNet2: Extending the coverage of adjectives in DanNet based on thesaurus data (project presentation) 2021.gwc-1.31 Q105730699 Proceedings of the 11th Global Wordnet Conference Q105730832 The 11th Global WordNet Conference
Q107009138 Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training 2021.naacl-main.278 Q107009154 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Q107009143 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Q107009138 Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training 2021.naacl-main.278 Q107009154 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Q107009143 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics
mbollmann commented 2 years ago

Is there any reason you want to do this on Wikidata, instead of using the XML/YAML files we have in this repo?

(FWIW I'm not aware of any Anthology maintainer being involved in Wikidata, so I would be surprised if anyone of us could help you there.)

WolfgangFahl commented 2 years ago

@mbollmann thx for the swift reply. Wikidata is just a good environment especial given the scholia project. See https://scholia.toolforge.org/event-series/Q56571145 for an entry for an example event. https://www.wikidata.org/wiki/Property:P7505 states that there are potentially 50.000 articles. On the aclanthology website I found "The ACL Anthology currently hosts 74465 papers on the study of computational linguistics and natural language processing. "

Indeed i might be interested in analysing the XML/YAML files and look for conference proceedings. It looks like there has not been a bot yet transferring the entries to wikidata (the wikicite project)

mbollmann commented 2 years ago

I see. I'm not familiar with the Scholia project unfortunately; I do know Wikidata, but I am not aware of any transfer between the ACL Anthology and Wikidata, or who might have done it for the entries that already exist there.

Here's a quick example of what you can get from our Python library (in bin/):

>>> ant = Anthology("../data/")
>>> paper = ant.papers["2020.lrec-1.520"]
>>> ant.volumes[paper.parent_volume_id].get_title()
'Proceedings of the 12th Language Resources and Evaluation Conference'
>>> ant.venues.get_main_venue("2020.lrec-1.520")
'LREC'
>>> ant.venues.get_by_acronym("LREC")["name"]
'International Conference on Language Resources and Evaluation'

...where "2020.lrec-1.520" can be any ACL paper ID, of course. The information is pulled from the XML/YAML files in the data/ directory, so of course you could also use other tools to extract data from them.