Closed PonteIneptique closed 3 years ago
Hey, I don't get the difference between this and #72
The set-up for the test is the same, but this apply to different features:
I think that'd make the code much harder to maintain and I am not convinced the gains are worth it. Even when considering your estimate, we are talking about 20 seconds for tagging a whole dataset. This encoding is never going to be the bottleneck, because it's just the conversion from strings to integer form. And even if you were referring to sharing the input embedding across models trained independently, the refactoring needed to make that work isn't going to be worth the small speed improvement you'd get (getting word-level embeddings is very cheap).
You can actually look at the implementation in #71 , the refactor is really really small... I know this is not a bottleneck, but the cost was quite small...
The only changes are the followings :
https://github.com/emanjavacas/pie/pull/71/files#diff-df0f6f3f56e1cc328ee783bb55949adeR129-R143
All other changes are new feature to accompany this change, including the test. I still feel like this is worth it :/
It's just a high cost to maintain: making sure older models work, new models train exactly as old ones, etc.. but especially it's another edge to keep in mind every time you want to tweak anything related to input processing (for example the transformer input). And really speed is not an issue here. Look at the speed up in terms of relative improvement and you see it's really not that important. I am sorry because I know you have already worked on the fix, but I have already run into a couple of issues in the past where a small change actually introduced totally unexpected bugs. It's just a headache that I'd rather not have.
On Tue, Aug 4, 2020 at 4:29 PM Thibault Clérice notifications@github.com wrote:
The only changes are the followings :
https://github.com/emanjavacas/pie/pull/71/files#diff-df0f6f3f56e1cc328ee783bb55949adeR129-R143
All other changes are new feature to accompany this change, including the test. I still feel like this is worth it :/
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/70#issuecomment-668630335, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPIPI3AQELYQYDBBQ5OTQLR7ALMXANCNFSM4PUG42FA .
-- Enrique Manjavacas
Hey :)
Technically, I do not touch models or label encoders. The only thing I add (and I really mean it) is a way to provide (optionally) at tagging time shared encoders. Models are not changed, tagging is not changed per se. I would completely agree with you but here I do not touch to anything in the models :/
As for bugs, there is a reason I carefully checked and provided tests with it :/
It's nice you wrote out some tests but unit testing isn't definitely ever 100% bulletproof. Still, my issue with this kind of small-scale patches is the mental overhead they create. For example, right now if we need to update the labelencoder we will have to make sure to also update this functionality to make sure it still works. That's just an overhead I would assume if the gains were substantial, but I don't think they are in this case.
On Tue, Aug 4, 2020 at 4:40 PM Thibault Clérice notifications@github.com wrote:
Hey :)
Technically, I do not touch models or label encoders. The only thing I add (and I really mean it) is a way to provide (optionally) at tagging time shared encoders. Models are not changed, tagging is not changed per se. I would completely agree with you but here I do not touch to anything in the models :/
As for bugs, there is a reason I carefully checked and provided tests with it :/
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/70#issuecomment-668636729, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPIPI6SBHWDEW6LQHL27HDR7AMVJANCNFSM4PUG42FA .
-- Enrique Manjavacas
I agree unit testing is not bullet proof. I just think that I disagree that the change will have impact down the road, as they are not at the level of the label encoder. The things I added at the label encoder level are helpers but all changes could have been kept with this single code snippet I showed you. The rest is just there for helping users use this feature
Le mar. 4 août 2020 à 5:01 PM, Enrique Manjavacas notifications@github.com a écrit :
It's nice you wrote out some tests but unit testing isn't definitely ever 100% bulletproof. Still, my issue with this kind of small-scale patches is the mental overhead they create. For example, right now if we need to update the labelencoder we will have to make sure to also update this functionality to make sure it still works. That's just an overhead I would assume if the gains were substantial, but I don't think they are in this case.
On Tue, Aug 4, 2020 at 4:40 PM Thibault Clérice notifications@github.com wrote:
Hey :)
Technically, I do not touch models or label encoders. The only thing I add (and I really mean it) is a way to provide (optionally) at tagging time shared encoders. Models are not changed, tagging is not changed per se. I would completely agree with you but here I do not touch to anything in the models :/
As for bugs, there is a reason I carefully checked and provided tests with it :/
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/70#issuecomment-668636729, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ABPIPI6SBHWDEW6LQHL27HDR7AMVJANCNFSM4PUG42FA
.
-- Enrique Manjavacas
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emanjavacas/pie/issues/70#issuecomment-668649169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZWABHUCL4ND7ZRHC6DR7APFDANCNFSM4PUG42FA .
I am closing this issue. I understand your argument, and I really do not think it's worth going further in the debate :) Worst case scenario, my code and reflection stay around and we can go back to it. It definitely is less impactful than #73
I wanted to evaluate for the future Latin model I am training (with 8 different models for lemma and each morpho-syntactic feature).
Here is a small snippet I created to evaluate this which shows that we could win by sharing things accross on tagging large corpora. It does not make a huge change to the code.
Current situation
Sharing encoders