Share input encoder at training and tagging time

PonteIneptique commented 4 years ago

I wanted to evaluate for the future Latin model I am training (with 8 different models for lemma and each morpho-syntactic feature).

Here is a small snippet I created to evaluate this which shows that we could win by sharing things accross on tagging large corpora. It does not make a huge change to the code.

Current situation

import time
import glob
import os.path

import tqdm

import pie
from pie.tagger import simple_tokenizer, pack_batch

# file = "PathToAModel"
# base_encoder = pie.model.BaseModel.load(file).label_encoder

example = list(simple_tokenizer("""Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Phasellus dolor sapien, laoreet non turpis eget, tincidunt commodo magna. Duis at dapibus ipsum. 
Etiam fringilla et magna sed vehicula. 
Nunc tristique eros non faucibus viverra. 
Sed dictum scelerisque tortor, eu ullamcorper odio. 
Aenean fermentum a urna quis tempus. 
Maecenas imperdiet est a nisi pellentesque dictum. 
Maecenas ac hendrerit ante. Vestibulum eleifend nulla at vulputate sagittis. 
Maecenas sed magna diam. Sed facilisis tempus ipsum, nec mattis elit tincidunt lobortis. 
Phasellus vel ex lorem. Nulla nunc odio, tempor non consequat in, luctus elementum dolor. 
Nullam tincidunt purus vel lorem placerat, ac pulvinar turpis sodales. 
Sed eget urna ac quam cursus porta. 
Pellentesque luctus aliquet sem, a egestas purus finibus ac. 
Mauris nec mauris non metus tempor faucibus non in est. 
Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. 
Proin tristique nulla nec purus iaculis, eu scelerisque mi egestas. 
In hac habitasse platea dictumst.
Ut placerat a neque eget aliquet. """))

print(f"Sentences : {len(example)}")
print(f"Words : {len([x for s in example for x in s])}")

runs = 1000
encoders = 8  # Latin has 8 tasks

records = []
for run in tqdm.tqdm(range(runs)):
    start = time.time()
    for encoder in range(encoders):
        pack_batch(base_encoder, example, "cpu")
    records.append(time.time() - start)

print(f"Average encoding time with {encoders} encoders {sum(records) / len(records)}")

Sentences : 22 Words : 194 100%|██████████| 1000/1000 [00:21<00:00, 46.80it/s] Average encoding time with 8 encoders 0.02126287913322449

Sharing encoders

records = []
for run in tqdm.tqdm(range(runs)):
    start = time.time()
    pack_batch(base_encoder, example, "cpu")
    records.append(time.time() - start)

print(f"Average encoding time with a single encoders {sum(records) / len(records)}")

100%|██████████| 1000/1000 [00:02<00:00, 352.26it/s] Average encoding time with a single encoders 0.002821772575378418

emanjavacas commented 4 years ago

Hey, I don't get the difference between this and #72

PonteIneptique commented 4 years ago

The set-up for the test is the same, but this apply to different features:

72 applies to not encoding wemb when unecessary
70 applies mostly to Tagger (but also to training with a smaller impact) where you use different models (say, for Latin, one for Case, one for Lemma, one for Tense, etc.). Instead of encoding the input N times (where N is the number of models), if they have the same input translation table (label_encoder.table) and settings, it should be possible to not recode the input. In this context, sharing the input encoder would give, for the Latin tagger, a 10 time smaller encoding time (~20 seconds for 194k tokens vs 2 seconds). This is a small QOL that for inference is a nice improvement.

emanjavacas commented 4 years ago

I think that'd make the code much harder to maintain and I am not convinced the gains are worth it. Even when considering your estimate, we are talking about 20 seconds for tagging a whole dataset. This encoding is never going to be the bottleneck, because it's just the conversion from strings to integer form. And even if you were referring to sharing the input embedding across models trained independently, the refactoring needed to make that work isn't going to be worth the small speed improvement you'd get (getting word-level embeddings is very cheap).

PonteIneptique commented 4 years ago

You can actually look at the implementation in #71 , the refactor is really really small... I know this is not a bottleneck, but the cost was quite small...

PonteIneptique commented 4 years ago

The only changes are the followings :

https://github.com/emanjavacas/pie/pull/71/files#diff-df0f6f3f56e1cc328ee783bb55949adeR129-R143

All other changes are new feature to accompany this change, including the test. I still feel like this is worth it :/

emanjavacas commented 4 years ago

It's just a high cost to maintain: making sure older models work, new models train exactly as old ones, etc.. but especially it's another edge to keep in mind every time you want to tweak anything related to input processing (for example the transformer input). And really speed is not an issue here. Look at the speed up in terms of relative improvement and you see it's really not that important. I am sorry because I know you have already worked on the fix, but I have already run into a couple of issues in the past where a small change actually introduced totally unexpected bugs. It's just a headache that I'd rather not have.

On Tue, Aug 4, 2020 at 4:29 PM Thibault Clérice notifications@github.com wrote:

The only changes are the followings :

https://github.com/emanjavacas/pie/pull/71/files#diff-df0f6f3f56e1cc328ee783bb55949adeR129-R143

All other changes are new feature to accompany this change, including the test. I still feel like this is worth it :/

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/70#issuecomment-668630335, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPIPI3AQELYQYDBBQ5OTQLR7ALMXANCNFSM4PUG42FA .

-- Enrique Manjavacas

PonteIneptique commented 4 years ago

Hey :)

Technically, I do not touch models or label encoders. The only thing I add (and I really mean it) is a way to provide (optionally) at tagging time shared encoders. Models are not changed, tagging is not changed per se. I would completely agree with you but here I do not touch to anything in the models :/

As for bugs, there is a reason I carefully checked and provided tests with it :/

emanjavacas commented 4 years ago

It's nice you wrote out some tests but unit testing isn't definitely ever 100% bulletproof. Still, my issue with this kind of small-scale patches is the mental overhead they create. For example, right now if we need to update the labelencoder we will have to make sure to also update this functionality to make sure it still works. That's just an overhead I would assume if the gains were substantial, but I don't think they are in this case.

On Tue, Aug 4, 2020 at 4:40 PM Thibault Clérice notifications@github.com wrote:

Hey :)

Technically, I do not touch models or label encoders. The only thing I add (and I really mean it) is a way to provide (optionally) at tagging time shared encoders. Models are not changed, tagging is not changed per se. I would completely agree with you but here I do not touch to anything in the models :/

As for bugs, there is a reason I carefully checked and provided tests with it :/

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/70#issuecomment-668636729, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPIPI6SBHWDEW6LQHL27HDR7AMVJANCNFSM4PUG42FA .

-- Enrique Manjavacas

PonteIneptique commented 4 years ago

I agree unit testing is not bullet proof. I just think that I disagree that the change will have impact down the road, as they are not at the level of the label encoder. The things I added at the label encoder level are helpers but all changes could have been kept with this single code snippet I showed you. The rest is just there for helping users use this feature

Le mar. 4 août 2020 à 5:01 PM, Enrique Manjavacas notifications@github.com a écrit :

It's nice you wrote out some tests but unit testing isn't definitely ever 100% bulletproof. Still, my issue with this kind of small-scale patches is the mental overhead they create. For example, right now if we need to update the labelencoder we will have to make sure to also update this functionality to make sure it still works. That's just an overhead I would assume if the gains were substantial, but I don't think they are in this case.

On Tue, Aug 4, 2020 at 4:40 PM Thibault Clérice notifications@github.com wrote:

Hey :)

Technically, I do not touch models or label encoders. The only thing I add (and I really mean it) is a way to provide (optionally) at tagging time shared encoders. Models are not changed, tagging is not changed per se. I would completely agree with you but here I do not touch to anything in the models :/

As for bugs, there is a reason I carefully checked and provided tests with it :/

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/70#issuecomment-668636729, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ABPIPI6SBHWDEW6LQHL27HDR7AMVJANCNFSM4PUG42FA

.

-- Enrique Manjavacas

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/70#issuecomment-668649169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZWABHUCL4ND7ZRHC6DR7APFDANCNFSM4PUG42FA .

PonteIneptique commented 3 years ago

I am closing this issue. I understand your argument, and I really do not think it's worth going further in the debate :) Worst case scenario, my code and reflection stay around and we can go back to it. It definitely is less impactful than #73

emanjavacas / pie

Share input encoder at training and tagging time #70

Current situation

Sharing encoders

72 applies to not encoding wemb when unecessary