WashingPostCollection does not work properly

Peilin-Yang commented 6 years ago

When I run to index the WashingtonPost collection v2: ~/Anserini/target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -input /scratch2/more_collections/WashingtonPost.v2/WashingtonPost.v2/data/ -generator JsoupGenerator -index lucene-index.wash18.pos+docvectors -threads 44 -storePositions -storeDocvectors -optimize &>log.wash18.pos+docvectors

The error occurs:

java.lang.NullPointerException
        at io.anserini.index.IndexCollection$IndexerThread.run(IndexCollection.java:198) [anserini-0.1.1-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
2018-08-02 10:24:33,729 WARN  [main] index.IndexCollection (IndexCollection.java:365) - Unexpected difference between number of indexed documents and index maxDoc.
2018-08-02 10:24:33,730 INFO  [main] index.IndexCollection (IndexCollection.java:368) - # Final Counter Values
2018-08-02 10:24:33,730 INFO  [main] index.IndexCollection (IndexCollection.java:369) - indexed:                0
2018-08-02 10:24:33,730 INFO  [main] index.IndexCollection (IndexCollection.java:370) - empty:                 84
2018-08-02 10:24:33,730 INFO  [main] index.IndexCollection (IndexCollection.java:371) - unindexed:             84
2018-08-02 10:24:33,730 INFO  [main] index.IndexCollection (IndexCollection.java:372) - unindexable:            0
2018-08-02 10:24:33,731 INFO  [main] index.IndexCollection (IndexCollection.java:373) - skipped:                3
2018-08-02 10:24:33,731 INFO  [main] index.IndexCollection (IndexCollection.java:374) - errors:                 0
2018-08-02 10:24:33,736 INFO  [main] index.IndexCollection (IndexCollection.java:377) - Total 30,745 documents indexed in 00:00:24

And the indexer quits unexpectedly

Kytabyte commented 6 years ago

@Peilin-Yang I addressed this exception results from the ContentObj in this line may be null.

After handling the Null exception, the program works fine. However, I noticed that the number of skipped files is quite large compared to my original design, which may affect the correctness of our results. I suspect there're some other problems in our code. I'll address it and do the pull request together

Kytabyte commented 6 years ago

@borislin I got a lot of expections while running WashtingtonPostCollection, which comes from initializing the JSON object here https://github.com/castorini/Anserini/blob/8de9fc046fcb188957c171190be927a2317d5430/src/main/java/io/anserini/collection/WashingtonPostCollection.java#L102 This results in considerable files (~80000 out of 600000) got skipped. One of the special case is due to an empty file, but from my observation most of them have contents.

Some sample errors are listed below. Do you have a clue what's going on?

com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize instance of `java.lang.String` out of START_OBJECT token
 at [Source: (String)"{"id": "76f870f2-5829-11e1-a0b0-4cc207a286f0", "article_url": "https://www.washingtonpost.com/lifestyle/style/the-surprising-hard-rock-source-behind-dcs-nobody-bothers-me-tv-jingle/2012/02/15/gIQAFzxdJR_story.html", "title": "The surprising, rock source behind D.C.’s ‘Nobody bothers me’ TV jingle", "author": "Chris Richards", "published_date": 1329485520000, "contents": [{"content": "Music", "mime": "text/plain", "type": "kicker"}, {"content": "The surprising, rock source behind D.C.’s ‘Nobody b"[truncated 13829 chars]; line: 1, column: 2629] (through reference chain: io.anserini.collection.WashingtonPostCollection$Document$WashingtonPostObject["contents"]->java.util.ArrayList[11]->io.anserini.collection.WashingtonPostCollection$Document$WashingtonPostObject$Content["content"])

,

com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize instance of `java.lang.String` out of START_ARRAY token
 at [Source: (String)"{"id": "7308c3ce-b563-11e1-9cb1-2a3ee465ab8e", "article_url": "https://www.washingtonpost.com/opinion/pinstripe-empire-the-new-york-yankees-from-before-the-babe-to-after-the-boss-by-marty-appeldamn-yankees-twenty-four-major-league-writers-on-the-worlds-most-loved-and-hated-team-edited-by-rob-fleder-driving-mr-yogi-yogi-berra-ron-guidry-and-baseballs-greatest-gift-b-y-harvey-araton/2012/06/29/gJQAnhqLCW_story.html", "title": "“Pinstripe Empire: The New York Yankees From Before the Babe To After t"[truncated 11203 chars]; line: 1, column: 11398] (through reference chain: io.anserini.collection.WashingtonPostCollection$Document$WashingtonPostObject["contents"]->java.util.ArrayList[23]->io.anserini.collection.WashingtonPostCollection$Document$WashingtonPostObject$Content["content"])

borislin commented 6 years ago

@Kytabyte I think this line is redundant. Remove this and try again?

Kytabyte commented 6 years ago

The reason for this issue is because the content may not only be a String, but possibly List<String> or Map<String, String>. I addressed a dirty way to solve this problem just to get the indexer working as soon as possible. I will raise a PR and we can discuss a neater way to do this.

Peilin-Yang commented 6 years ago

@Kytabyte I think there are still two problems:

I think we should include title and fullcaption field.
When concatenating paragraphs we should explicitly set a splitter, e.g. \n or `. The default Lucene Analyzer will not split the terms by.`

Peilin-Yang commented 6 years ago

Related to #383, you should also update the unit test and make sure it actually works

Kytabyte commented 6 years ago

Related to #383, you should also update the unit test and make sure it actually works

@Peilin-Yang Has this issue already been solved?

Peilin-Yang commented 6 years ago

@Kytabyte I just merged the PR, you should be able to send PR on top of it.

Peilin-Yang commented 6 years ago

Merged #390

castorini / anserini

WashingPostCollection does not work properly #375