castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 457 forks source link

Indexing fails for acl anthology #1787

Closed edanerg closed 1 year ago

edanerg commented 2 years ago

Command used:

sh target/appassembler/bin/IndexCollection \
-collection BibtexCollection -generator BibtexGenerator \
-threads 8 -input {/path/to/bib_files/} \
-index {/path/to/bibtex_indexes} \
-storePositions -storeDocvectors -storeContents -storeRaw

Full error message

2022-03-01 02:28:28,573 INFO  [main] index.IndexCollection (IndexCollection.java:643) - Setting log level to INFO
2022-03-01 02:28:28,576 INFO  [main] index.IndexCollection (IndexCollection.java:646) - Starting indexer...
2022-03-01 02:28:28,576 INFO  [main] index.IndexCollection (IndexCollection.java:647) - ============ Loading Parameters ============
2022-03-01 02:28:28,577 INFO  [main] index.IndexCollection (IndexCollection.java:648) - DocumentCollection path: /home/y3222wan/anthology_test
2022-03-01 02:28:28,577 INFO  [main] index.IndexCollection (IndexCollection.java:649) - CollectionClass: BibtexCollection
2022-03-01 02:28:28,578 INFO  [main] index.IndexCollection (IndexCollection.java:650) - Generator: BibtexGenerator
2022-03-01 02:28:28,578 INFO  [main] index.IndexCollection (IndexCollection.java:651) - Threads: 8
2022-03-01 02:28:28,578 INFO  [main] index.IndexCollection (IndexCollection.java:652) - Stemmer: porter
2022-03-01 02:28:28,579 INFO  [main] index.IndexCollection (IndexCollection.java:653) - Keep stopwords? false
2022-03-01 02:28:28,579 INFO  [main] index.IndexCollection (IndexCollection.java:654) - Stopwords:  null
2022-03-01 02:28:28,579 INFO  [main] index.IndexCollection (IndexCollection.java:655) - Store positions? true
2022-03-01 02:28:28,580 INFO  [main] index.IndexCollection (IndexCollection.java:656) - Store docvectors? true
2022-03-01 02:28:28,580 INFO  [main] index.IndexCollection (IndexCollection.java:657) - Store document "contents" field? true
2022-03-01 02:28:28,580 INFO  [main] index.IndexCollection (IndexCollection.java:658) - Store document "raw" field? true
2022-03-01 02:28:28,581 INFO  [main] index.IndexCollection (IndexCollection.java:659) - Optimize (merge segments)? false
2022-03-01 02:28:28,581 INFO  [main] index.IndexCollection (IndexCollection.java:660) - Whitelist: null
2022-03-01 02:28:28,581 INFO  [main] index.IndexCollection (IndexCollection.java:661) - Pretokenized?: false
2022-03-01 02:28:28,581 INFO  [main] index.IndexCollection (IndexCollection.java:681) - Directly building Lucene indexes...
2022-03-01 02:28:28,582 INFO  [main] index.IndexCollection (IndexCollection.java:682) - Index path: /home/y3222wan/anthology_index_test
2022-03-01 02:28:28,587 INFO  [main] index.IndexCollection (IndexCollection.java:731) - ============ Indexing Collection ============
2022-03-01 02:28:28,801 INFO  [main] index.IndexCollection (IndexCollection.java:832) - Thread pool with 8 threads initialized.
2022-03-01 02:28:28,801 INFO  [main] index.IndexCollection (IndexCollection.java:834) - Initializing collection in /home/y3222wan/anthology_test
2022-03-01 02:28:28,806 INFO  [main] index.IndexCollection (IndexCollection.java:843) - 1 file found
2022-03-01 02:28:28,806 INFO  [main] index.IndexCollection (IndexCollection.java:844) - Starting to index...
2022-03-01 02:29:26,916 ERROR [pool-2-thread-1] collection.BibtexCollection$Segment (BibtexCollection.java:77) - Error: Could not parse BibTeXEncountered unexpected token: "o" <NAME>
    at line 559750, column 22.

Was expecting one of:

    "#"
    ","
    "}"

2022-03-01 02:29:26,918 ERROR [pool-2-thread-1] index.IndexCollection$LocalIndexerThread (IndexCollection.java:251) - pool-2-thread-1: Unexpected Exception:
java.io.IOException: org.jbibtex.ParseException: Encountered unexpected token: "o" <NAME>
    at line 559750, column 22.

Was expecting one of:

    "#"
    ","
    "}"

    at io.anserini.collection.BibtexCollection$Segment.<init>(BibtexCollection.java:78) ~[anserini-0.13.6-SNAPSHOT.jar:?]
    at io.anserini.collection.BibtexCollection.createFileSegment(BibtexCollection.java:54) ~[anserini-0.13.6-SNAPSHOT.jar:?]
    at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:186) [anserini-0.13.6-SNAPSHOT.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.jbibtex.ParseException: Encountered unexpected token: "o" <NAME>
    at line 559750, column 22.

Was expecting one of:

    "#"
    ","
    "}"

    at org.jbibtex.BibTeXParser.generateParseException(BibTeXParser.java:1164) ~[jbibtex-1.0.17.jar:?]
    at org.jbibtex.BibTeXParser.jj_consume_token(BibTeXParser.java:1035) ~[jbibtex-1.0.17.jar:?]
    at org.jbibtex.BibTeXParser.Entry(BibTeXParser.java:472) ~[jbibtex-1.0.17.jar:?]
    at org.jbibtex.BibTeXParser.Object(BibTeXParser.java:293) ~[jbibtex-1.0.17.jar:?]
    at org.jbibtex.BibTeXParser.Database(BibTeXParser.java:246) ~[jbibtex-1.0.17.jar:?]
    at org.jbibtex.BibTeXParser.parse(BibTeXParser.java:57) ~[jbibtex-1.0.17.jar:?]
    at io.anserini.collection.BibtexCollection$Segment.<init>(BibtexCollection.java:75) ~[anserini-0.13.6-SNAPSHOT.jar:?]
    ... 5 more
2022-03-01 02:29:26,973 INFO  [main] index.IndexCollection (IndexCollection.java:928) - Indexing Complete! 0 documents indexed
2022-03-01 02:29:26,973 INFO  [main] index.IndexCollection (IndexCollection.java:929) - ============ Final Counter Values ============
2022-03-01 02:29:26,973 INFO  [main] index.IndexCollection (IndexCollection.java:930) - indexed:                0
2022-03-01 02:29:26,973 INFO  [main] index.IndexCollection (IndexCollection.java:931) - unindexable:            0
2022-03-01 02:29:26,974 INFO  [main] index.IndexCollection (IndexCollection.java:932) - empty:                  0
2022-03-01 02:29:26,974 INFO  [main] index.IndexCollection (IndexCollection.java:933) - skipped:                0
2022-03-01 02:29:26,974 INFO  [main] index.IndexCollection (IndexCollection.java:934) - errors:                 0
2022-03-01 02:29:26,980 INFO  [main] index.IndexCollection (IndexCollection.java:937) - Total 0 documents indexed in 00:00:58

The problem probably comes from bibtexcollection. This file uses a third party library to do the parsing and the error comes from that

public Segment(Path path) throws IOException {
      super(path);
      bufferedReader = new BufferedReader(new FileReader(path.toString()));
      BibTeXParser bibtexParser;
      try {
        bibtexParser = new BibTeXParser();
      } catch (TokenMgrException | ParseException e) {
        LOG.error("Error: Could not initialize BibTeX parser" + e.getMessage());
        throw new IOException(e);
      }
      try {
       database = bibtexParser.parse(bufferedReader);
      } catch (ParseException | TokenMgrException | ObjectResolutionException e) {
        LOG.error("Error: Could not parse BibTeX" + e.getMessage()); 
        throw new IOException(e);
      }
      Map<Key, BibTeXEntry> entryMap = database.getEntries();
      iterator = entryMap.entrySet().iterator();
    }
lintool commented 2 years ago

Can you isolate the troublesome record?

ToluClassics commented 2 years ago

This is one

@inproceedings{a-gorog-2014-quality, title = "Quality evaluation today: the Dynamic Quality Framework", author = {{A.G{\"o}r{\"o}g}}, booktitle = "Proceedings of Translating and the Computer 36", month = nov # " 27-28", year = "2014", address = "London, UK", publisher = "AsLing", url = "https://aclanthology.org/2014.tc-1.21", }

lintool commented 2 years ago

So, if the error comes from a 3rd party lib, we should just eat the exception and move on?

Can we build this into a test case?

edanerg commented 2 years ago

The trouble some records for me are

@inproceedings{santosh-etal-2020-detecting,
    title = "Detecting Emerging Symptoms of {COVID}-19 using Context-based {T}witter Embeddings",
    author = "Santosh, Roshan  and
      Schwartz, H.  and
      Eichstaedt, Johannes  and
      Ungar, Lyle  and
      Guntuku, Sharath Chandra",
    booktitle = "Proceedings of the 1st Workshop on {NLP} for {COVID}-19 (Part 2) at {EMNLP} 2020",
    month = dec,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.nlpcovid19-2.35",
    doi = "10.18653/v1/2020.nlpcovid19-2.35",
    abstract = "In this paper, we present an iterative graph-based approach for the detection of symptoms of COVID-19, the pathology of which seems to be evolving. More generally, the method can be applied to finding context-specific words and texts (e.g. symptom mentions) in large imbalanced corpora (e.g. all tweets mentioning }{\#}COVID-19). Given the novelty of COVID-19, we also test if the proposed approach generalizes to the problem of detecting Adverse Drug Reaction (ADR). We find that the approach applied to Twitter data can detect symptom mentions substantially before to their being reported by the Centers for Disease Control (CDC).",
}
@inproceedings{ferret-2014-compounds,
    title = "Compounds and distributional thesauri",
    author = "Ferret, Olivier",
    booktitle = "Proceedings of the Ninth International Conference on Language Resources and Evaluation ({LREC}'14)",
    month = may,
    year = "2014",
    address = "Reykjavik, Iceland",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/754_Paper.pdf",
    pages = "2979--2984",
    abstract = "The building of distributional thesauri from corpora is a problem that was the focus of a significant number of articles, starting with (Grefenstette, 1994} and followed by (Lin, 1998}, (Curran and Moens, 2002) or (Heylen and Peirsman, 2007). However, in all these cases, only single terms were considered. More recently, the topic of compositionality in the framework of distributional semantic representations has come to the surface and was investigated for building the semantic representation of phrases or even sentences from the representation of their words. However, this work was not done until now with the objective of building distributional thesauri. In this article, we investigate the impact of the introduction of compounds for achieving such building. More precisely, we consider compounds as undividable lexical units and evaluate their influence according to three different roles: as features in the distributional contexts of single terms, as possible neighbors of single term entries and finally, as entries of a thesaurus. This investigation was conducted through an intrinsic evaluation for a large set of nominal English single terms and compounds with various frequencies.",
}

I modified them to

@inproceedings{santosh-etal-2020-detecting,
    title = "Detecting Emerging Symptoms of {COVID}-19 using Context-based {T}witter Embeddings",
    author = "Santosh, Roshan  and
      Schwartz, H.  and
      Eichstaedt, Johannes  and
      Ungar, Lyle  and
      Guntuku, Sharath Chandra",
    booktitle = "Proceedings of the 1st Workshop on {NLP} for {COVID}-19 (Part 2) at {EMNLP} 2020",
    month = dec,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.nlpcovid19-2.35",
    doi = "10.18653/v1/2020.nlpcovid19-2.35",
    abstract = "In this paper, we present an iterative graph-based approach for the detection of symptoms of COVID-19, the pathology of which seems to be evolving. More generally, the method can be applied to finding context-specific words and texts (e.g. symptom mentions) in large imbalanced corpora (e.g. all tweets mentioning {\#}COVID-19). Given the novelty of COVID-19, we also test if the proposed approach generalizes to the problem of detecting Adverse Drug Reaction (ADR). We find that the approach applied to Twitter data can detect symptom mentions substantially before to their being reported by the Centers for Disease Control (CDC).",
}
@inproceedings{ferret-2014-compounds,
    title = "Compounds and distributional thesauri",
    author = "Ferret, Olivier",
    booktitle = "Proceedings of the Ninth International Conference on Language Resources and Evaluation ({LREC}'14)",
    month = may,
    year = "2014",
    address = "Reykjavik, Iceland",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/754_Paper.pdf",
    pages = "2979--2984",
    abstract = "The building of distributional thesauri from corpora is a problem that was the focus of a significant number of articles, starting with (Grefenstette, 1994) and followed by (Lin, 1998), (Curran and Moens, 2002) or (Heylen and Peirsman, 2007). However, in all these cases, only single terms were considered. More recently, the topic of compositionality in the framework of distributional semantic representations has come to the surface and was investigated for building the semantic representation of phrases or even sentences from the representation of their words. However, this work was not done until now with the objective of building distributional thesauri. In this article, we investigate the impact of the introduction of compounds for achieving such building. More precisely, we consider compounds as undividable lexical units and evaluate their influence according to three different roles: as features in the distributional contexts of single terms, as possible neighbors of single term entries and finally, as entries of a thesaurus. This investigation was conducted through an intrinsic evaluation for a large set of nominal English single terms and compounds with various frequencies.",
}

(Remove one bracket for the first one and paired brackets for the second one). And I was able to index the file successfully.

ygorg commented 1 year ago

Maybe indexing the anthology via this method could be easier ? anserini/blob/master/docs/acl-anthology.md

lintool commented 1 year ago

I think this issue has been addressed? Given that we have https://github.com/castorini/pyserini/pull/1552 - and there have been no complaints about index failures...

Closing for now.