cliqz-oss / keyvi

Keyvi - a key value index that powers Cliqz search engine. It is an in-memory FST-based data structure highly optimized for size and lookup performance.
https://cliqz.com
Apache License 2.0
179 stars 38 forks source link

Document parameters - what and how - for keyvi command line and python API #213

Open netankit opened 7 years ago

netankit commented 7 years ago

Currently, we use keyvi compiler option of "floating_point_precision" for word embeddings in sharding/compiling step. It would be nice to pass this option to command line / python api of keyvi for any keyvi file where the values will be vectors.

Ex: keyvi_compiler_options = {"minimization": "off", "floating_point_precision": "single"}

This will be helpful in reducing the size of massive keyvi files composed of vector values. (Ex. Document Vectors- ~2.1B Vectors ~ 300 Dimensions). I haven't been able to figure out how one can use this feature. A standalone example with documentation will be useful. @hendrikmuhs

hendrikmuhs commented 7 years ago

Hey @netankit

you are right, all the config options lack documentation (and over time they became quite a few).

This is how you do it on the cmdline:

keyvicompiler -i float.txt -o float.kv_s -d json -V floating_point_precision=single

(Note that you - talking about size - add compression as well: keyvicompiler -i float.txt -o float.kv_s -d json -V floating_point_precision=single -V compression=zlib )

On the python side you pass it as a dictionary:

See https://github.com/cliqz-oss/keyvi/blob/master/pykeyvi/tests/json/json_dictionary_test.py#L65

cs = pykeyvi.JsonDictionaryCompiler(50000000, {'floating_point_precision': 'single'})

The first parameter is the memory limit, which has to be given in order to pass the parameter dictionary as the 2nd argument.

Equivalent to above compression can be added by e.g. 'compression': 'zlib'.

hendrikmuhs commented 7 years ago

Note: The parameter parsing will change for 0.2 to make it more consistent. The memory limit which is right now an extra parameter will move into the parameter dictionary, so that all configurations are given by a python dictionary or a std::map<string, string> on the CPP side.

Changing title and label.

netankit commented 7 years ago

@hendrikmuhs Thanks for the detailed reply. I will use this for the time being. So, from v0.2 are keyvicompiler and keyviinspector completely going to be removed in favor of keyvi compile/dump?

hendrikmuhs commented 7 years ago

ah, got it. It seems the keyvi cli tool does not support parameters yet. Good point, we should add it.

What I meant with 0.2 is moving memory_limit into the parameters, so the python call would look like:

cs = pykeyvi.JsonDictionaryCompiler({'floating_point_precision': 'single', 'memory_limit_mb': '50'})

There are no removal plans for keyvicompiler and/or keyviinspector. The keyvi cli (based on python) is just an alternative to the native tools. Use whatever you like.

The idea behind keyvi cli is faster implementation, it is much much easier to implement something in python + pykeyvi, than writing it in the cpp app. That means we will probably implement new features in keyvi cli only. But will see.

hendrikmuhs commented 7 years ago

opened #214