Full text search on article contents

viric commented 8 years ago

Do you have any plan to provide full text search on article contents? It would be a great feature.

Thank you.

akhenakh commented 8 years ago

I was worried the indexes would have been extremely big, I wanted something light to be used everywhere. The kiwix project is providing zim with indexes but the specs were not public at the time I was writing gozim, we could maybe give it a look again.

viric commented 8 years ago

Well, indices may be big depending on the size of the zim you index. I don't want it for the English Wikipedia, definitely. :) But for some wiktionaries, it would be great.

akhenakh commented 8 years ago

Makes sense, it means we need to extract the html then feed it to the indexer. I keep the ticket open.

akhenakh commented 8 years ago

give a look to ea9b7c39cb1d13bd8bf19ba4dc4e2a16bab52f14.

Note that I had to decreade batch very low cause of memory consumption:

gozimindex -lang fr -batchsize 50  -content  -path ../../wikinews_fr_all_nopic_2015-11.zim -index ../../wikinews_fr_all_nopic_2015-11.idx

The resulting index is 220M for a 20M zim file but it's working :) probably needs more optim here

kelson42 commented 8 years ago

A few comments:

At the very beginning of the openZIM project, a few tests were done to make a fulltext index, without a lot of sucess. An the conclusion is that this should be done using an already existing solution like lucene or xapian for example
Now, per default, the ZIM files do not provide a full text index, but Kiwix uses an inditional fulltext index build on the top of ZIM using Xapian
in 2016 we will probably offer ZIM with the Xapian fulltext index integrated in the ZIM files themself. So it will be possible to offer a fulltext search by decoding this index/blob.

akhenakh commented 8 years ago

Sounds like good news, thanks for the tip.

viric commented 8 years ago

I think that you always think of ZIM files of several GB, and indexed full-text-search (FTS). I think that it'd be nice if there were a solution of a slow FTS, just unpacking every article on search request.

That be slow... fine. But that's much better (for me, who thinks in dictionaries of ~20MB) than having no FTS at all.

For me, one thing is having FTS, and the other is having an indexed-FTS (faster but much more complicated). But bleve masters might know how to implement it easily.

akhenakh commented 8 years ago

@viric did you try my last commit? It's an indexed FTS. Should work for small zim files.

If you have some difficulties building it, I may create binaries.

viric commented 8 years ago

Ah I missed your comment. I will try. No need to create binaries. The index is really huge, though.

Thank you! (Now I have to find out how to create my own zim files ...)

viric commented 8 years ago

Ok, I tested it! Thank you.

Of course I would like better search :) That is, 1) show context of the match, 2) allow for seeing more results.

viric commented 8 years ago

By the way, I packaged gozim for nixpkgs. https://github.com/NixOS/nixpkgs/commit/99077ff6486b1bbb02365290cb94c49799dbb425

akhenakh commented 8 years ago

:+1: for NixOS. Remember this part is not well tested yet, in fact I'm fixing a bug right now.

viric commented 8 years ago

Great. Notice that the content I'm using is in Catalan. So you are welcome to add bleve analysis/language/ca :)

akhenakh commented 8 years ago

Catalan is already supported with the full bleve stemmer -lang ca, which sometimes truncate proper noun, that's why I'm using smaller stems for eng and fr but still it should work as is.

akhenakh / gozim

Full text search on article contents #20