Optimization of time on dev set

PonteIneptique commented 4 years ago

Hey @emanjavacas , I think I might need advise here. I am running training on a lot of different corpora, on a relatively "big corpus" (1.5M tokens training). It's related to #29 but not totally. On this corpus, one epoch takes 600 sec.

Dev corpus is ~ 80k tokens

Getting dev losses is at least few minutes long.
Getting complete scores (unknown, ambiguous, stuff like that) takes around twice as much.

I believe that currently, eval takes more time than training. Which in itself seems weird, but I can't see where to optimize more.

Code optimization aside, what would you recommend in settings to avoid this ?

PonteIneptique commented 4 years ago

I can actually confirm that going from loss to the full details on dev, it's at least 7 minutes. As I was on the phone, I got a measure :)

emanjavacas commented 4 years ago

There are 2 issues. One is that all the computation up to the decoder is being computed twice. Once for the loss (inside Trainer.evaluate: https://github.com/emanjavacas/pie/blob/master/pie/trainer.py#L240), another one for the predictions (inside BaseModel.evaluate: https://github.com/emanjavacas/pie/blob/3db90001592308317f2737de06193ab6c6e3716f/pie/models/base_model.py#L53). Second is that decoding is actually slower than just computing the loss (which is the only thing we do during training). The first one could be optimized, but it'd involve some proper refactoring, which I wasn't willing to do when I first wrote that code. The second one not so much.

A suggestion would be to reduce the dev set size.

Another one is to reduce the number of checks per epoch. For such a big dataset you could just do it once per epohc.

And the final suggestion is to just be patient :-p, let the thing run over night.

PonteIneptique commented 4 years ago

So, my feeling is that both .evaluate() could be indeed easier and quick to refactor. My second feeling is are you sure we need to decode for evaluation (maybe except for lemma ?) ?

Report is 1/epoch, and yes, I can definitely be patient, but still, 40 model training might take 10 days instead of ~ 5 potentially.

For the dev-set, it felt like 5% of the whole dataset was already quite small...

PonteIneptique commented 4 years ago

I'll take a shot at some optimizations for the decoder regarding not recomputing.

PonteIneptique commented 4 years ago

I have actually look at things a little more, and I am hesitating about what can be done.

I am feeling like predict() could potentially only return logits, ie. without decoding using a _reverse=True default to not change how the function works currently.
Sharing logits between trainer.evaluate which gets the loss and model.evaluate() which also do a forward pass feels like something that should be done, but how is the question, given that model.evaluate() uses .predict() to feeds the output...

For 2., first feeling, model.evaluate called as part of trainer.evaluate() where models loss function can be used in predict somehow ?

PonteIneptique commented 4 years ago

As for numbers I was a little off but here they are:

Full Epoch: Train + dev = 521.427
Dev Time only = 366.3106174468994

For 1/20th of the data :(

emanjavacas commented 4 years ago

I'd be a bit unwilling to do major refactorings on that side. Are you using attentional decoder with beam search? (you might want to deactivate the beam search during dev). Or just use a smaller devset. For a corpus of 1.5M you don't have to stick to a 5%. You'll get a very robust dev performance estimate on something like 10k tokens.

PonteIneptique commented 4 years ago

Thanks @emanjavacas Actually, the forward pass is not the issue. The loss computation is 7 seconds apparently... I note your thing for the size of dev, but I'd be happy to look into why we have this amount of time in model.evaluate. What do you think ? I'd guess issue with decoding mostly...

PonteIneptique commented 4 years ago

I confirm that model.evaluate() is responsible here: 376.516760349273s:)

emanjavacas commented 4 years ago

It's clearly decoding. Depending on the method, it's just expensive. Beam search slows it down quite a bit, for instance. It will happen when you use the model for tagging as well.

On Fri, Feb 14, 2020 at 1:00 PM Thibault Clérice notifications@github.com wrote:

I confirm that model.evaluate() is responsible here: 376.516760349273s:)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/48?email_source=notifications&email_token=ABPIPI2WJP6D2PRMHU3GJJLRC2BVBA5CNFSM4KURTVH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELYZWJA#issuecomment-586259236, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPIPI42EPYXKZT3IHCNSZLRC2BVBANCNFSM4KURTVHQ .

-- Enrique Manjavacas

PonteIneptique commented 4 years ago

I don't think I am using beam search, would you recommend it?

Le ven. 14 févr. 2020 à 1:04 PM, Enrique Manjavacas < notifications@github.com> a écrit :

It's clearly decoding. Depending on the method, it's just expensive. Beam search slows it down quite a bit, for instance. It will happen when you use the model for tagging as well.

On Fri, Feb 14, 2020 at 1:00 PM Thibault Clérice <notifications@github.com

wrote:

I confirm that model.evaluate() is responsible here: 376.516760349273s:)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/emanjavacas/pie/issues/48?email_source=notifications&email_token=ABPIPI2WJP6D2PRMHU3GJJLRC2BVBA5CNFSM4KURTVH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELYZWJA#issuecomment-586259236 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ABPIPI42EPYXKZT3IHCNSZLRC2BVBANCNFSM4KURTVHQ

.

-- Enrique Manjavacas

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/48?email_source=notifications&email_token=AAOXEZRA4DHTFZ4J5FEO6PTRC2CFFA5CNFSM4KURTVH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELYZ7TI#issuecomment-586260429, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZVQYGMEO3VEBOXAJWLRC2CFFANCNFSM4KURTVHQ .

emanjavacas commented 4 years ago

I think it's not being used during dev, indeed. It usually gives a small boost in performance (94.2 to 94.5 or so). But it's a bit expensive.

PonteIneptique commented 4 years ago

I am still looking to enhancing the results, but just so you know, with AttentionalDecoder predict_max, the time of eval remain the same with 10k tokens on Latin, from 80k tokens...

I'll see if the issue actually lies only there. Which probably it is.

PonteIneptique commented 4 years ago

Ok, I just found the weirdest thing. Dev time is highly proportionel to train time. Dev set has nearly no impact on it.

PonteIneptique commented 4 years ago

Code used to get the print which get's me to understand it comes most probably to instantiating scorer...

 def evaluate(self, dataset, trainset=None, **kwargs):
        """
        Get scores per task

        dataset: pie.data.Dataset, dataset to evaluate on (your dev or test set)
        trainset: pie.data.Dataset (optional), if passed scores for unknown and ambiguous
            tokens can be computed
        **kwargs: any other arguments to Model.predict
        """
        assert not self.training, "Ooops! Inference in training mode. Call model.eval()"

        scorers = {}
        for task, le in self.label_encoder.tasks.items():
            scorers[task] = Scorer(le, trainset)

        with torch.no_grad():
            for (inp, tasks), (rinp, rtasks) in tqdm.tqdm(
                    dataset.batch_generator(return_raw=True)):
                # (inp, tasks) == encoded
                start = time.time()
                preds = self.predict(inp, **kwargs)
                now = time.time()
                print("Predict {}".format(now - start))
                start = now
                # - get input tokens
                tokens = [w for line in rinp for w in line]

                # - get trues
                trues = {}
                for task in preds:
                    #....

                now = time.time()
                print("Trues {}".format(now - start))
                start = now

                # accumulate
                for task, scorer in scorers.items():
                    scorer.register_batch(preds[task], trues[task], tokens)

                now = time.time()
                print("Accumulate {}".format(now - start))
                start = now

        return scorers

With 99k Train data

Check Dev Loss Duration|| 1.9269027709960938 || 0 w/s
0it [00:00, ?it/s]Predict 1.4107513427734375
Trues 0.0044536590576171875
Accumulate 0.00015997886657714844
1it [00:01,  1.71s/it]Predict 1.3279740810394287
Trues 0.004570484161376953
Accumulate 0.00021266937255859375
2it [00:03,  1.60s/it]Predict 0.1679234504699707
Trues 0.0009925365447998047
Accumulate 4.458427429199219e-05
3it [00:03,  1.13s/it]Predict 1.389880657196045
Trues 0.004476070404052734
Accumulate 0.00025916099548339844
4it [00:04,  1.24s/it]Predict 1.381545066833496
Trues 0.004101753234863281
Accumulate 0.0004475116729736328
5it [00:06,  1.31s/it]Predict 1.3811256885528564
Trues 0.0044710636138916016
Accumulate 0.0005216598510742188
6it [00:08,  1.35s/it]
Just Evaluate || 29.148979663848877

With 1.6M train data.

Check Dev Loss Duration|| 1.8826487064361572 || 0 w/s
0it [00:00, ?it/s]Predict 1.4889159202575684
Trues 0.005564212799072266
Accumulate 0.00016260147094726562
1it [00:01,  1.79s/it]Predict 1.5353443622589111
Trues 0.005633115768432617
Accumulate 0.0002377033233642578
2it [00:03,  1.75s/it]Predict 1.3615729808807373
Trues 0.004605531692504883
Accumulate 0.0002830028533935547
3it [00:05,  1.67s/it]Predict 0.19774889945983887
Trues 0.001180887222290039
Accumulate 5.555152893066406e-05
4it [00:05,  1.31s/it]Predict 1.4601054191589355
Trues 0.0048258304595947266
Accumulate 0.0003323554992675781
5it [00:06,  1.38s/it]Predict 1.3555328845977783
Trues 0.005044221878051758
Accumulate 0.00034236907958984375
6it [00:08,  1.40s/it]
Just Evaluate || 349.12779903411865

PonteIneptique commented 4 years ago

If this really is it, the fix is SOOO simple...

PonteIneptique commented 4 years ago

I confirm then : timing for each Scorer initiation.

        start = time.time()
        scorers = {}
        for task, le in self.label_encoder.tasks.items():
            scorers[task] = Scorer(le, trainset)
            now = time.time()
            print("Initiating scorer for{} : {}".format(task, now - start))
            start = now

Initiating scorer forlemma : 30.46245050430298
Initiating scorer forpos : 30.20334005355835
Initiating scorer forDis : 30.176021099090576
Initiating scorer forGend : 30.083760023117065
Initiating scorer forNumb : 30.384766340255737
Initiating scorer forCase : 30.72568941116333
Initiating scorer forDeg : 32.61731815338135
Initiating scorer forMood : 33.193106174468994
Initiating scorer forTense : 30.71157741546631
Initiating scorer forVoice : 31.059696197509766
Initiating scorer forPerson : 30.477245807647705

emanjavacas / pie

Optimization of time on dev set #48

With 99k Train data

With 1.6M train data.