Closed PonteIneptique closed 5 years ago
Ok, the numbers are crushing it...
Loading with json
2019-04-12 11:44:50,062 : Model latest-lasla-lat-lemma-2019_02_07-10_58_52-json.tar was serialized with a previous version of
pie
. This might result in issues. Model commit is bb0b727, whereas currentpie
commit is 78d9cf4. 2019-04-12 11:44:50,344 : * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Loading with yaml
2019-04-12 11:35:47,830 : Model latest-lasla-lat-lemma-2019_02_07-10_58_52-json.tar was serialized with a previous version of
pie
. This might result in issues. Model commit is bb0b727, whereas currentpie
commit is 78d9cf4. 2019-04-12 11:36:03,046 : * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Script used for convertion:
import gzip
import json
import yaml
import tarfile
import shutil
from pie import utils
#with open("/home/thibault/dev/pie/latest-lasla-lat-lemma-2019_02_07-10_58_52.tar", "rb") as f:
# print(gzip.open(f).read().decode().strip())
fpath = "/home/thibault/dev/pie/latest-lasla-lat-lemma-2019_02_07-10_58_52.tar"
target = fpath.replace(".tar", "-json.tar")
label_encoder_path = 'label_encoder.zip'
def removeFile(src, tgt, remove):
"""Remove nameToDelete from tarfile filename."""
original = tarfile.open(src)
modified = tarfile.open(tgt, 'w')
for info in original.getmembers():
if info.name == remove:
continue
extracted = original.extractfile(info)
if not extracted:
continue
modified.addfile(info, extracted)
original.close()
modified.close()
removeFile(fpath, target, label_encoder_path)
with tarfile.open(utils.ensure_ext(fpath, 'tar'), 'r') as tar:
data = yaml.load(gzip.open(tar.extractfile(label_encoder_path)).read().decode().strip())
del tar
shutil.copy(fpath, target)
with tarfile.open(utils.ensure_ext(target, 'tar'), 'a') as tar:
utils.add_gzip_to_tar(json.dumps(data), label_encoder_path, tar)
Hey, pyyaml is unfortunately a bit slow by default. I am speeding it up with the optimized C-parser. But for that to work you need to have installed the yaml-cpp bindings in your machine: https://stackoverflow.com/a/27744056
that should solve the problem
btw. reason for using yaml is just better type support. Yaml can even serialized numpy data and all.
Given the graph in the second response ( https://stackoverflow.com/revisions/50685946/2 ), and my previous results, is there a reason for yaml here ?
@emanjavacas I understand now in general, but in the context of label encoder (which from what I see is a plain set of dictionaries and list), is there a case here ?
I guess there must've been one. If it bothers you a lot, you can try to change it, although that'd be a breaking change wrt backwards compatibility...
Well, technically not, because you do a try except on loading data ;)
See
;)
But do not worry about this one. At the moment, I convert it, it does not need to update pie because of this try/except. If you think we can, then we do it ;)
Ah ye, I remember I moved it from json to yaml (that's why the try block is there, for backwards compatibility). I don't remember why, so there might not be a reason. But since fast yaml is needed anyway for fast setting serialization, I guess it really wouldn't matter performance-wise. So, basically, it shouldn't matter if you use json or yaml in that case.
So, what do you think, here are the options I can think of :
Anyway, I'll follow you on that ;)
mmh, sorry are you talking about the speed up for yaml? or just changing the format for label encoder?
if the first, it'd be cool to automate the installation of the c libraries in setup.py, but I doubt a cross-platform solution for that would be easy... if the second, feel free to submit a PR to change label encoder to json (try running the tests first, there might still be some issue with that though that I cannot remember now).
we can close this, I believe!
I believe it is !
On our web API, ever since I trained the Latin model, I have been bugged by the loading time. For long, I thought it depended on the parameters mostly but after profiling last week, I discovered that yaml was immensely responsible for that.
This morning, I compared the loading time on the "label_encoder.json" file :
json
uses4.57s
for 100 iterations with code 1yaml
uses1350.5313s
for 100 iterations with code 2I am gonna dig a little more but it seems that loading with JSON might be the best thing to do in terms of performances.
Convert code
Code 1
Code 2