Closed PonteIneptique closed 4 years ago
Nice catch! I was checking the code and that was indeed the only bit where trainset was being used for dev! I will merge this on Monday.
On Sat, 15 Feb 2020 at 14:50, Thibault Clérice notifications@github.com wrote:
…tasks
Fixed #48 https://github.com/emanjavacas/pie/issues/48
BaseModel keeps now in memory (!= GPU memory) the unknown and the ambiguous tokens per task. This solves a bug where per run_check, the train set would be read + preprocessed + analyzed n_task x 2 where 2 is ambigous and known reading. Eg. on a 1.7M tokens trainset with 11 tasks, and 10k tokens in dev, we have:
Before fix: ~ 360 sec / run_check / epoch After_fix:
- 1st epoch: 35 sec
- subsequent: ~10 sec
Added logging of evaluation time to keep track of potential improvement down the line
I'll probably ask for a release bump after this bug fix :D
You can view, comment on, or merge this pull request online at:
https://github.com/emanjavacas/pie/pull/49 Commit Summary
- Scorer/Trainer now use scanning of trainset only once for all epochs/tasks
File Changes
- M .gitignore https://github.com/emanjavacas/pie/pull/49/files#diff-0 (4)
- M pie/models/base_model.py https://github.com/emanjavacas/pie/pull/49/files#diff-1 (21)
- M pie/models/scorer.py https://github.com/emanjavacas/pie/pull/49/files#diff-2 (44)
- M pie/trainer.py https://github.com/emanjavacas/pie/pull/49/files#diff-3 (3)
Patch Links:
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/pull/49?email_source=notifications&email_token=ABPIPI7C6SB2IGUTAM7YZ5LRC7XKJA5CNFSM4KVY2UTKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4INYVLIQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPIPI3Q3Y3HCJLC3PXBF3LRC7XKJANCNFSM4KVY2UTA .
-- Enrique Manjavacas
Once it was clear trainset was related, the fix was indeed SOOO easy :D Plus. the fix should actually lighten what we have in memory, the way python deals with memory I believe known tokens are only once in memory, vs once per task :)
And thanks ;)
And have a nice week-end :D
Thanks for the fix!
…tasks
Fixed #48
BaseModel keeps now in memory (!= GPU memory) the unknown and the ambiguous tokens per task. This solves a bug where per
run_check
, the train set would be read + preprocessed + analyzedn_task x 2
where 2 is ambigous and known reading. Eg. on a 1.7M tokens trainset with 11 tasks, and 10k tokens in dev, we have:Before fix: ~ 360 sec / run_check / epoch After_fix:
Added logging of evaluation time to keep track of potential improvement down the line
I'll probably ask for a release bump after this bug fix :D