emanjavacas / pie

A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.
MIT License
22 stars 10 forks source link

Yaml loading time evaluation #24

Closed PonteIneptique closed 5 years ago

PonteIneptique commented 5 years ago

On our web API, ever since I trained the Latin model, I have been bugged by the loading time. For long, I thought it depended on the parameters mostly but after profiling last week, I discovered that yaml was immensely responsible for that.

This morning, I compared the loading time on the "label_encoder.json" file :

I am gonna dig a little more but it seems that loading with JSON might be the best thing to do in terms of performances.

Convert code

import gzip

with open("label_encoder.zip", "rb") as f:
    print(gzip.open(f).read().decode().strip())

import yaml

with open("label_encoder.yaml") as f:
    data = yaml.load(f)

import json

with open("label_encoder.json", "w") as f:
    json.dump(data, f)

Code 1

import json
def test():
    with open("label_encoder.json") as f:
        data = json.load(f)

if __name__ == '__main__':
    import timeit
    print(timeit.timeit("test()", setup="from __main__ import test", number=100))

Code 2

import yaml

def test():
    with open("label_encoder.yaml") as f:
        data = yaml.load(f)

if __name__ == '__main__':
    import timeit
    print(timeit.timeit("test()", setup="from __main__ import test", number=100))
PonteIneptique commented 5 years ago

Ok, the numbers are crushing it...

Loading with json

2019-04-12 11:44:50,062 : Model latest-lasla-lat-lemma-2019_02_07-10_58_52-json.tar was serialized with a previous version of pie. This might result in issues. Model commit is bb0b727, whereas current pie commit is 78d9cf4. 2019-04-12 11:44:50,344 : * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Loading with yaml

2019-04-12 11:35:47,830 : Model latest-lasla-lat-lemma-2019_02_07-10_58_52-json.tar was serialized with a previous version of pie. This might result in issues. Model commit is bb0b727, whereas current pie commit is 78d9cf4. 2019-04-12 11:36:03,046 : * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Script used for convertion:

import gzip
import json
import yaml
import tarfile
import shutil
from pie import utils

#with open("/home/thibault/dev/pie/latest-lasla-lat-lemma-2019_02_07-10_58_52.tar", "rb") as f:
#   print(gzip.open(f).read().decode().strip())

fpath = "/home/thibault/dev/pie/latest-lasla-lat-lemma-2019_02_07-10_58_52.tar"
target = fpath.replace(".tar", "-json.tar")
label_encoder_path = 'label_encoder.zip'

def removeFile(src, tgt, remove):
    """Remove nameToDelete from tarfile filename."""
    original = tarfile.open(src)
    modified = tarfile.open(tgt, 'w')
    for info in original.getmembers():
        if info.name == remove:
            continue
        extracted = original.extractfile(info)
        if not extracted:
            continue
        modified.addfile(info, extracted)
    original.close()
    modified.close()

removeFile(fpath, target, label_encoder_path)

with tarfile.open(utils.ensure_ext(fpath, 'tar'), 'r') as tar:
    data = yaml.load(gzip.open(tar.extractfile(label_encoder_path)).read().decode().strip())
del tar

shutil.copy(fpath, target)
with tarfile.open(utils.ensure_ext(target, 'tar'), 'a') as tar:
    utils.add_gzip_to_tar(json.dumps(data), label_encoder_path, tar)
emanjavacas commented 5 years ago

Hey, pyyaml is unfortunately a bit slow by default. I am speeding it up with the optimized C-parser. But for that to work you need to have installed the yaml-cpp bindings in your machine: https://stackoverflow.com/a/27744056

that should solve the problem

emanjavacas commented 5 years ago

btw. reason for using yaml is just better type support. Yaml can even serialized numpy data and all.

PonteIneptique commented 5 years ago

Given the graph in the second response ( https://stackoverflow.com/revisions/50685946/2 ), and my previous results, is there a reason for yaml here ?

PonteIneptique commented 5 years ago

@emanjavacas I understand now in general, but in the context of label encoder (which from what I see is a plain set of dictionaries and list), is there a case here ?

emanjavacas commented 5 years ago

I guess there must've been one. If it bothers you a lot, you can try to change it, although that'd be a breaking change wrt backwards compatibility...

PonteIneptique commented 5 years ago

Well, technically not, because you do a try except on loading data ;)

See

https://github.com/emanjavacas/pie/blob/a06a137c08189ebd44f8496f7d32398baa4ee9e5/pie/models/base_model.py#L171-L172

and https://github.com/emanjavacas/pie/blob/a06a137c08189ebd44f8496f7d32398baa4ee9e5/pie/data/dataset.py#L408-L415

;)

PonteIneptique commented 5 years ago

But do not worry about this one. At the moment, I convert it, it does not need to update pie because of this try/except. If you think we can, then we do it ;)

emanjavacas commented 5 years ago

Ah ye, I remember I moved it from json to yaml (that's why the try block is there, for backwards compatibility). I don't remember why, so there might not be a reason. But since fast yaml is needed anyway for fast setting serialization, I guess it really wouldn't matter performance-wise. So, basically, it shouldn't matter if you use json or yaml in that case.

PonteIneptique commented 5 years ago

So, what do you think, here are the options I can think of :

  1. Should we leave it like that and have people like me hack it ?
  2. Should we add a script that could change that for people to switch to json ?
  3. Should we make it a parameter of the CLI train ?

Anyway, I'll follow you on that ;)

emanjavacas commented 5 years ago

mmh, sorry are you talking about the speed up for yaml? or just changing the format for label encoder?

if the first, it'd be cool to automate the installation of the c libraries in setup.py, but I doubt a cross-platform solution for that would be easy... if the second, feel free to submit a PR to change label encoder to json (try running the tests first, there might still be some issue with that though that I cannot remember now).

emanjavacas commented 5 years ago

we can close this, I believe!

PonteIneptique commented 5 years ago

I believe it is !