Closed jakubzitny closed 6 years ago
ERROR: more than one doc found
is a problem. This must be happening when SourcererCC searches for the tokens of a document (using document id as the query) in the forward index. Ideally we should never get more than one doc, as document id for each document should be unique. My guess is that this is happening because we might have assigned same id for more than one document in the parsing stage, or may be we indexed one document twice.
About the ArrayIndexOutOfBoundsException
, let's keep a track of these characters. We should remove them during the tokenizing stage.
I meet the similar problem with you. My error message is
EXCEPTION CAUGHT, invalid line: Bud% @� @ @ @ E%DSDB
@ @ @
index size of GTPM: 66012
Directory: dataset
indexing file : .DS_Store
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at noindex.CloneHelper.deserialise(Unknown Source)
at indexbased.SearchManager.doIndex(Unknown Source)
at indexbased.SearchManager.main(Unknown Source)`
Every step of the operation is performed in accordance with the steps in readme. What does it mean? My data is wrong?
please delete the .DS_Store file from the dataset directory.
I'm hoping this solved the problem. If not, please open another issue.
Some tokens from some of the tokenized files seem problematic for SourcererCC.
Here is an example of
stderr
when the indexing fails, the contents is e.g. weird whitespaces or chars like||
:While indexing, there is a lot of
EXCEPTION CAUGHT
messages coming from caughtArrayIndexOutOfBoundsException
s in CloneHelper.java. I'm not sure if it's a problem or not.Also, while searching I am getting a lot of
ERROR: more that one doc found. some error here.
messages.Maybe these problems are some small things in the tokenization process, you have any ideas what it might be? For now, I will update the handling of weird whitespaces and see how it helps.