Closed edanerg closed 1 year ago
Can you isolate the troublesome record?
This is one
@inproceedings{a-gorog-2014-quality, title = "Quality evaluation today: the Dynamic Quality Framework", author = {{A.G{\"o}r{\"o}g}}, booktitle = "Proceedings of Translating and the Computer 36", month = nov # " 27-28", year = "2014", address = "London, UK", publisher = "AsLing", url = "https://aclanthology.org/2014.tc-1.21", }
So, if the error comes from a 3rd party lib, we should just eat the exception and move on?
Can we build this into a test case?
The trouble some records for me are
@inproceedings{santosh-etal-2020-detecting,
title = "Detecting Emerging Symptoms of {COVID}-19 using Context-based {T}witter Embeddings",
author = "Santosh, Roshan and
Schwartz, H. and
Eichstaedt, Johannes and
Ungar, Lyle and
Guntuku, Sharath Chandra",
booktitle = "Proceedings of the 1st Workshop on {NLP} for {COVID}-19 (Part 2) at {EMNLP} 2020",
month = dec,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.nlpcovid19-2.35",
doi = "10.18653/v1/2020.nlpcovid19-2.35",
abstract = "In this paper, we present an iterative graph-based approach for the detection of symptoms of COVID-19, the pathology of which seems to be evolving. More generally, the method can be applied to finding context-specific words and texts (e.g. symptom mentions) in large imbalanced corpora (e.g. all tweets mentioning }{\#}COVID-19). Given the novelty of COVID-19, we also test if the proposed approach generalizes to the problem of detecting Adverse Drug Reaction (ADR). We find that the approach applied to Twitter data can detect symptom mentions substantially before to their being reported by the Centers for Disease Control (CDC).",
}
@inproceedings{ferret-2014-compounds,
title = "Compounds and distributional thesauri",
author = "Ferret, Olivier",
booktitle = "Proceedings of the Ninth International Conference on Language Resources and Evaluation ({LREC}'14)",
month = may,
year = "2014",
address = "Reykjavik, Iceland",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/754_Paper.pdf",
pages = "2979--2984",
abstract = "The building of distributional thesauri from corpora is a problem that was the focus of a significant number of articles, starting with (Grefenstette, 1994} and followed by (Lin, 1998}, (Curran and Moens, 2002) or (Heylen and Peirsman, 2007). However, in all these cases, only single terms were considered. More recently, the topic of compositionality in the framework of distributional semantic representations has come to the surface and was investigated for building the semantic representation of phrases or even sentences from the representation of their words. However, this work was not done until now with the objective of building distributional thesauri. In this article, we investigate the impact of the introduction of compounds for achieving such building. More precisely, we consider compounds as undividable lexical units and evaluate their influence according to three different roles: as features in the distributional contexts of single terms, as possible neighbors of single term entries and finally, as entries of a thesaurus. This investigation was conducted through an intrinsic evaluation for a large set of nominal English single terms and compounds with various frequencies.",
}
I modified them to
@inproceedings{santosh-etal-2020-detecting,
title = "Detecting Emerging Symptoms of {COVID}-19 using Context-based {T}witter Embeddings",
author = "Santosh, Roshan and
Schwartz, H. and
Eichstaedt, Johannes and
Ungar, Lyle and
Guntuku, Sharath Chandra",
booktitle = "Proceedings of the 1st Workshop on {NLP} for {COVID}-19 (Part 2) at {EMNLP} 2020",
month = dec,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.nlpcovid19-2.35",
doi = "10.18653/v1/2020.nlpcovid19-2.35",
abstract = "In this paper, we present an iterative graph-based approach for the detection of symptoms of COVID-19, the pathology of which seems to be evolving. More generally, the method can be applied to finding context-specific words and texts (e.g. symptom mentions) in large imbalanced corpora (e.g. all tweets mentioning {\#}COVID-19). Given the novelty of COVID-19, we also test if the proposed approach generalizes to the problem of detecting Adverse Drug Reaction (ADR). We find that the approach applied to Twitter data can detect symptom mentions substantially before to their being reported by the Centers for Disease Control (CDC).",
}
@inproceedings{ferret-2014-compounds,
title = "Compounds and distributional thesauri",
author = "Ferret, Olivier",
booktitle = "Proceedings of the Ninth International Conference on Language Resources and Evaluation ({LREC}'14)",
month = may,
year = "2014",
address = "Reykjavik, Iceland",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/754_Paper.pdf",
pages = "2979--2984",
abstract = "The building of distributional thesauri from corpora is a problem that was the focus of a significant number of articles, starting with (Grefenstette, 1994) and followed by (Lin, 1998), (Curran and Moens, 2002) or (Heylen and Peirsman, 2007). However, in all these cases, only single terms were considered. More recently, the topic of compositionality in the framework of distributional semantic representations has come to the surface and was investigated for building the semantic representation of phrases or even sentences from the representation of their words. However, this work was not done until now with the objective of building distributional thesauri. In this article, we investigate the impact of the introduction of compounds for achieving such building. More precisely, we consider compounds as undividable lexical units and evaluate their influence according to three different roles: as features in the distributional contexts of single terms, as possible neighbors of single term entries and finally, as entries of a thesaurus. This investigation was conducted through an intrinsic evaluation for a large set of nominal English single terms and compounds with various frequencies.",
}
(Remove one bracket for the first one and paired brackets for the second one). And I was able to index the file successfully.
Maybe indexing the anthology via this method could be easier ? anserini/blob/master/docs/acl-anthology.md
I think this issue has been addressed? Given that we have https://github.com/castorini/pyserini/pull/1552 - and there have been no complaints about index failures...
Closing for now.
Command used:
Full error message
The problem probably comes from bibtexcollection. This file uses a third party library to do the parsing and the error comes from that