akhenakh / gozim

A Go native implementation for ZIM files
MIT License
205 stars 35 forks source link

Full text search on article contents #20

Closed viric closed 8 years ago

viric commented 8 years ago

Do you have any plan to provide full text search on article contents? It would be a great feature.

Thank you.

akhenakh commented 8 years ago

I was worried the indexes would have been extremely big, I wanted something light to be used everywhere. The kiwix project is providing zim with indexes but the specs were not public at the time I was writing gozim, we could maybe give it a look again.

viric commented 8 years ago

Well, indices may be big depending on the size of the zim you index. I don't want it for the English Wikipedia, definitely. :) But for some wiktionaries, it would be great.

akhenakh commented 8 years ago

Makes sense, it means we need to extract the html then feed it to the indexer. I keep the ticket open.

akhenakh commented 8 years ago

give a look to ea9b7c39cb1d13bd8bf19ba4dc4e2a16bab52f14.

Note that I had to decreade batch very low cause of memory consumption:

gozimindex -lang fr -batchsize 50  -content  -path ../../wikinews_fr_all_nopic_2015-11.zim -index ../../wikinews_fr_all_nopic_2015-11.idx

The resulting index is 220M for a 20M zim file but it's working :) probably needs more optim here

kelson42 commented 8 years ago

A few comments:

akhenakh commented 8 years ago

Sounds like good news, thanks for the tip.

viric commented 8 years ago

I think that you always think of ZIM files of several GB, and indexed full-text-search (FTS). I think that it'd be nice if there were a solution of a slow FTS, just unpacking every article on search request.

That be slow... fine. But that's much better (for me, who thinks in dictionaries of ~20MB) than having no FTS at all.

For me, one thing is having FTS, and the other is having an indexed-FTS (faster but much more complicated). But bleve masters might know how to implement it easily.

akhenakh commented 8 years ago

@viric did you try my last commit? It's an indexed FTS. Should work for small zim files.

If you have some difficulties building it, I may create binaries.

viric commented 8 years ago

Ah I missed your comment. I will try. No need to create binaries. The index is really huge, though.

Thank you! (Now I have to find out how to create my own zim files ...)

viric commented 8 years ago

Ok, I tested it! Thank you.

Of course I would like better search :) That is, 1) show context of the match, 2) allow for seeing more results.

viric commented 8 years ago

By the way, I packaged gozim for nixpkgs. https://github.com/NixOS/nixpkgs/commit/99077ff6486b1bbb02365290cb94c49799dbb425

akhenakh commented 8 years ago

:+1: for NixOS. Remember this part is not well tested yet, in fact I'm fixing a bug right now.

viric commented 8 years ago

Great. Notice that the content I'm using is in Catalan. So you are welcome to add bleve analysis/language/ca :)

akhenakh commented 8 years ago

Catalan is already supported with the full bleve stemmer -lang ca, which sometimes truncate proper noun, that's why I'm using smaller stems for eng and fr but still it should work as is.