Open netankit opened 7 years ago
Hey @netankit
you are right, all the config options lack documentation (and over time they became quite a few).
This is how you do it on the cmdline:
keyvicompiler -i float.txt -o float.kv_s -d json -V floating_point_precision=single
(Note that you - talking about size - add compression as well: keyvicompiler -i float.txt -o float.kv_s -d json -V floating_point_precision=single -V compression=zlib )
On the python side you pass it as a dictionary:
See https://github.com/cliqz-oss/keyvi/blob/master/pykeyvi/tests/json/json_dictionary_test.py#L65
cs = pykeyvi.JsonDictionaryCompiler(50000000, {'floating_point_precision': 'single'})
The first parameter is the memory limit, which has to be given in order to pass the parameter dictionary as the 2nd argument.
Equivalent to above compression can be added by e.g. 'compression': 'zlib'.
Note: The parameter parsing will change for 0.2 to make it more consistent. The memory limit which is right now an extra parameter will move into the parameter dictionary, so that all configurations are given by a python dictionary or a std::map<string, string> on the CPP side.
Changing title and label.
@hendrikmuhs Thanks for the detailed reply. I will use this for the time being. So, from v0.2 are keyvicompiler and keyviinspector completely going to be removed in favor of keyvi compile/dump?
ah, got it. It seems the keyvi cli tool does not support parameters yet. Good point, we should add it.
What I meant with 0.2 is moving memory_limit into the parameters, so the python call would look like:
cs = pykeyvi.JsonDictionaryCompiler({'floating_point_precision': 'single', 'memory_limit_mb': '50'})
There are no removal plans for keyvicompiler and/or keyviinspector. The keyvi cli (based on python) is just an alternative to the native tools. Use whatever you like.
The idea behind keyvi cli is faster implementation, it is much much easier to implement something in python + pykeyvi, than writing it in the cpp app. That means we will probably implement new features in keyvi cli only. But will see.
opened #214
Currently, we use keyvi compiler option of "floating_point_precision" for word embeddings in sharding/compiling step. It would be nice to pass this option to command line / python api of keyvi for any keyvi file where the values will be vectors.
Ex: keyvi_compiler_options = {"minimization": "off", "floating_point_precision": "single"}
This will be helpful in reducing the size of massive keyvi files composed of vector values. (Ex. Document Vectors- ~2.1B Vectors ~ 300 Dimensions). I haven't been able to figure out how one can use this feature. A standalone example with documentation will be useful. @hendrikmuhs