Scorer/Trainer now use scanning of trainset only once for all epochs/…

PonteIneptique commented 4 years ago

…tasks

Fixed #48

BaseModel keeps now in memory (!= GPU memory) the unknown and the ambiguous tokens per task. This solves a bug where per run_check, the train set would be read + preprocessed + analyzed n_task x 2 where 2 is ambigous and known reading. Eg. on a 1.7M tokens trainset with 11 tasks, and 10k tokens in dev, we have:

Before fix: ~ 360 sec / run_check / epoch After_fix:

1st epoch: 35 sec
subsequent: ~10 sec

Added logging of evaluation time to keep track of potential improvement down the line

I'll probably ask for a release bump after this bug fix :D

emanjavacas commented 4 years ago

Nice catch! I was checking the code and that was indeed the only bit where trainset was being used for dev! I will merge this on Monday.

On Sat, 15 Feb 2020 at 14:50, Thibault Clérice notifications@github.com wrote:

…tasks

Fixed #48 https://github.com/emanjavacas/pie/issues/48

BaseModel keeps now in memory (!= GPU memory) the unknown and the ambiguous tokens per task. This solves a bug where per run_check, the train set would be read + preprocessed + analyzed n_task x 2 where 2 is ambigous and known reading. Eg. on a 1.7M tokens trainset with 11 tasks, and 10k tokens in dev, we have:

Before fix: ~ 360 sec / run_check / epoch After_fix:

1st epoch: 35 sec

subsequent: ~10 sec

Added logging of evaluation time to keep track of potential improvement down the line

I'll probably ask for a release bump after this bug fix :D

You can view, comment on, or merge this pull request online at:

https://github.com/emanjavacas/pie/pull/49 Commit Summary

Scorer/Trainer now use scanning of trainset only once for all epochs/tasks

File Changes

M .gitignore https://github.com/emanjavacas/pie/pull/49/files#diff-0 (4)

M pie/models/base_model.py https://github.com/emanjavacas/pie/pull/49/files#diff-1 (21)

M pie/models/scorer.py https://github.com/emanjavacas/pie/pull/49/files#diff-2 (44)

M pie/trainer.py https://github.com/emanjavacas/pie/pull/49/files#diff-3 (3)

Patch Links:

https://github.com/emanjavacas/pie/pull/49.patch

https://github.com/emanjavacas/pie/pull/49.diff

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/pull/49?email_source=notifications&email_token=ABPIPI7C6SB2IGUTAM7YZ5LRC7XKJA5CNFSM4KVY2UTKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4INYVLIQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPIPI3Q3Y3HCJLC3PXBF3LRC7XKJANCNFSM4KVY2UTA .

-- Enrique Manjavacas

PonteIneptique commented 4 years ago

Once it was clear trainset was related, the fix was indeed SOOO easy :D Plus. the fix should actually lighten what we have in memory, the way python deals with memory I believe known tokens are only once in memory, vs once per task :)

PonteIneptique commented 4 years ago

And thanks ;)

PonteIneptique commented 4 years ago

And have a nice week-end :D

emanjavacas commented 4 years ago

Thanks for the fix!

emanjavacas / pie

Scorer/Trainer now use scanning of trainset only once for all epochs/… #49

I'll probably ask for a release bump after this bug fix :D