Closed viric closed 8 years ago
I was worried the indexes would have been extremely big, I wanted something light to be used everywhere. The kiwix project is providing zim with indexes but the specs were not public at the time I was writing gozim, we could maybe give it a look again.
Well, indices may be big depending on the size of the zim you index. I don't want it for the English Wikipedia, definitely. :) But for some wiktionaries, it would be great.
Makes sense, it means we need to extract the html then feed it to the indexer. I keep the ticket open.
give a look to ea9b7c39cb1d13bd8bf19ba4dc4e2a16bab52f14.
Note that I had to decreade batch very low cause of memory consumption:
gozimindex -lang fr -batchsize 50 -content -path ../../wikinews_fr_all_nopic_2015-11.zim -index ../../wikinews_fr_all_nopic_2015-11.idx
The resulting index is 220M for a 20M zim file but it's working :) probably needs more optim here
A few comments:
Sounds like good news, thanks for the tip.
I think that you always think of ZIM files of several GB, and indexed full-text-search (FTS). I think that it'd be nice if there were a solution of a slow FTS, just unpacking every article on search request.
That be slow... fine. But that's much better (for me, who thinks in dictionaries of ~20MB) than having no FTS at all.
For me, one thing is having FTS, and the other is having an indexed-FTS (faster but much more complicated). But bleve masters might know how to implement it easily.
@viric did you try my last commit? It's an indexed FTS. Should work for small zim files.
If you have some difficulties building it, I may create binaries.
Ah I missed your comment. I will try. No need to create binaries. The index is really huge, though.
Thank you! (Now I have to find out how to create my own zim files ...)
Ok, I tested it! Thank you.
Of course I would like better search :) That is, 1) show context of the match, 2) allow for seeing more results.
By the way, I packaged gozim for nixpkgs. https://github.com/NixOS/nixpkgs/commit/99077ff6486b1bbb02365290cb94c49799dbb425
:+1: for NixOS. Remember this part is not well tested yet, in fact I'm fixing a bug right now.
Great. Notice that the content I'm using is in Catalan. So you are welcome to add bleve analysis/language/ca :)
Catalan is already supported with the full bleve stemmer -lang ca
, which sometimes truncate proper noun, that's why I'm using smaller stems for eng and fr but still it should work as is.
Do you have any plan to provide full text search on article contents? It would be a great feature.
Thank you.