refactor tokenizer module

at15 / forum-search

craw, store and index data for later search

MIT License

1 stars 0 forks source link

Closed at15 closed 8 years ago

at15 commented 8 years ago

[x] use class instead of concat string using ;,;; directly
[x] move jackson deps to parent pom.xml
[x] use json to store tokenize result, though the result would be a lot bigger, it's easier to handle
[x] wrap HanLp token class
[x] class call DocIndex, which includes, url, term positions, ranks, ~~tokenize result~~
[x] class call TermIndex(Info) which is the inverted index. it might be better to have it in indexer module

at15 commented 8 years ago

HanLP already provided a module for tokenize, but it use standard tokenizer for keyword extractor

at15 commented 8 years ago

it's ok to store the index in a whole file now, time to split it and query against it.

at15 commented 8 years ago

e... have to say .... use json make the index file really big .... 3mb -> 70mb