Problematic tokens from tokenizer?

jakubzitny commented 8 years ago

Some tokens from some of the tokenized files seem problematic for SourcererCC.

Here is an example of stderr when the indexing fails, the contents is e.g. weird whitespaces or chars like ||:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at noindex.CloneHelper.deserialise(Unknown Source)
    at indexbased.SearchManager.doIndex(Unknown Source)
    at indexbased.SearchManager.main(Unknown Source)

While indexing, there is a lot of EXCEPTION CAUGHT messages coming from caught ArrayIndexOutOfBoundsExceptions in CloneHelper.java. I'm not sure if it's a problem or not.

Also, while searching I am getting a lot of ERROR: more that one doc found. some error here. messages.

Maybe these problems are some small things in the tokenization process, you have any ideas what it might be? For now, I will update the handling of weird whitespaces and see how it helps.

saini commented 8 years ago

ERROR: more than one doc found is a problem. This must be happening when SourcererCC searches for the tokens of a document (using document id as the query) in the forward index. Ideally we should never get more than one doc, as document id for each document should be unique. My guess is that this is happening because we might have assigned same id for more than one document in the parsing stage, or may be we indexed one document twice.

About the ArrayIndexOutOfBoundsException, let's keep a track of these characters. We should remove them during the tokenizing stage.

Yanming-Yang commented 7 years ago

I meet the similar problem with you. My error message is EXCEPTION CAUGHT, invalid line: Bud% @� @ @ @ E%DSDB @ @ @ index size of GTPM: 66012 Directory: dataset indexing file : .DS_Store Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at noindex.CloneHelper.deserialise(Unknown Source) at indexbased.SearchManager.doIndex(Unknown Source) at indexbased.SearchManager.main(Unknown Source)` Every step of the operation is performed in accordance with the steps in readme. What does it mean? My data is wrong?

saini commented 7 years ago

please delete the .DS_Store file from the dataset directory.

pedromartins4 commented 6 years ago

I'm hoping this solved the problem. If not, please open another issue.

Mondego / SourcererCC

Problematic tokens from tokenizer? #6