Closed Peilin-Yang closed 6 years ago
@Peilin-Yang I addressed this exception results from the ContentObj
in this line may be null
.
After handling the Null exception, the program works fine. However, I noticed that the number of skipped files
is quite large compared to my original design, which may affect the correctness of our results. I suspect there're some other problems in our code. I'll address it and do the pull request together
@borislin I got a lot of expections while running WashtingtonPostCollection
, which comes from initializing the JSON object here https://github.com/castorini/Anserini/blob/8de9fc046fcb188957c171190be927a2317d5430/src/main/java/io/anserini/collection/WashingtonPostCollection.java#L102
This results in considerable files (~80000 out of 600000) got skipped. One of the special case is due to an empty file, but from my observation most of them have contents.
Some sample errors are listed below. Do you have a clue what's going on?
com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize instance of `java.lang.String` out of START_OBJECT token
at [Source: (String)"{"id": "76f870f2-5829-11e1-a0b0-4cc207a286f0", "article_url": "https://www.washingtonpost.com/lifestyle/style/the-surprising-hard-rock-source-behind-dcs-nobody-bothers-me-tv-jingle/2012/02/15/gIQAFzxdJR_story.html", "title": "The surprising, rock source behind D.C.’s ‘Nobody bothers me’ TV jingle", "author": "Chris Richards", "published_date": 1329485520000, "contents": [{"content": "Music", "mime": "text/plain", "type": "kicker"}, {"content": "The surprising, rock source behind D.C.’s ‘Nobody b"[truncated 13829 chars]; line: 1, column: 2629] (through reference chain: io.anserini.collection.WashingtonPostCollection$Document$WashingtonPostObject["contents"]->java.util.ArrayList[11]->io.anserini.collection.WashingtonPostCollection$Document$WashingtonPostObject$Content["content"])
,
com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize instance of `java.lang.String` out of START_ARRAY token
at [Source: (String)"{"id": "7308c3ce-b563-11e1-9cb1-2a3ee465ab8e", "article_url": "https://www.washingtonpost.com/opinion/pinstripe-empire-the-new-york-yankees-from-before-the-babe-to-after-the-boss-by-marty-appeldamn-yankees-twenty-four-major-league-writers-on-the-worlds-most-loved-and-hated-team-edited-by-rob-fleder-driving-mr-yogi-yogi-berra-ron-guidry-and-baseballs-greatest-gift-b-y-harvey-araton/2012/06/29/gJQAnhqLCW_story.html", "title": "“Pinstripe Empire: The New York Yankees From Before the Babe To After t"[truncated 11203 chars]; line: 1, column: 11398] (through reference chain: io.anserini.collection.WashingtonPostCollection$Document$WashingtonPostObject["contents"]->java.util.ArrayList[23]->io.anserini.collection.WashingtonPostCollection$Document$WashingtonPostObject$Content["content"])
The reason for this issue is because the content
may not only be a String
, but possibly List<String>
or Map<String, String>
. I addressed a dirty way to solve this problem just to get the indexer working as soon as possible. I will raise a PR and we can discuss a neater way to do this.
@Kytabyte I think there are still two problems:
\n
or `. The default Lucene Analyzer will not split the terms by
.`Related to #383, you should also update the unit test and make sure it actually works
Related to #383, you should also update the unit test and make sure it actually works
@Peilin-Yang Has this issue already been solved?
@Kytabyte I just merged the PR, you should be able to send PR on top of it.
Merged #390
When I run to index the WashingtonPost collection v2:
~/Anserini/target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -input /scratch2/more_collections/WashingtonPost.v2/WashingtonPost.v2/data/ -generator JsoupGenerator -index lucene-index.wash18.pos+docvectors -threads 44 -storePositions -storeDocvectors -optimize &>log.wash18.pos+docvectors
The error occurs:
And the indexer quits unexpectedly