Closed bpopeters closed 6 years ago
Hey @bpopeters, Thanks this is helpful and fair criticisms. Also cc'ing @vince62s .
I think the training code has changed recently to support new features. One of the conflicts is trying to support advanced features like multi-gpu. However, I am all for simplifying the code and enforcing better standards. Maybe we can think of some direct interventions to simplify and document the main pathways of the code.
I agree that it's important to support multi-gpu, but issues as far as I can see them are sort of orthogonal to that...I think, anyway. With the changes I've made in my own branch, I don't think I broken anything with multi-gpu (although I can't tell for sure because there's only been one GPU free for me this week).
Hi @bpopeters , cc'ing @pltrdy too. As Sasha says, this is helpful and we would love to see better code. There many different things in what you raise. First, don't forget the current code is a lot of history done by many people, and sometimes at times where some features were not implemented in pytorch / torchtext. So yes, we may have created classes, objects, and functions that now exist in pytroch directly. Many coders mean not easy to have very clean code.
My main contribution was to refactor trying to make the code clearer and support multi-gpu which was a big leap forward in my view. @pltrdy also contributed and added some new stuff that abstract better stats, reporting, .... I understand it's a bit cumbersome to read when you come after the fact but from a "software" design I really think it's a plus.
Having said that, I fully understand your point of view. I'm more an enterprise user, Paul is a researcher, and we were trying to join both. If you have specific contribution that make things clearer feel free, I think we all want this project to be better every day.
BTW, I have been trying to clean hundreds of issues, most of them now are old features request that nobody is really willing to code PR for, so unsure what to do with all of these.
Cheers.
Ben, I have not looked at your trainer.py code, but multi-gpu is VERY tricky. the main point is that each process has to load batches spaced by each n gpu and you have to make sure that each process ends at the same time, ie process the exact same number of batches, or pad if needed,otherwise one process may be still hanging and wait for ever.
I had very hard time to debug it at the beginning, so you would really need to make sur it works with several gpu.
Interesting discussion. I personally find it pretty hard to keep something "simple" at each level of abstraction. Implementing a feature as simply as possible may result in over-assuming how people will use it, and/or restricting possibility to extend it. To solve it, one may add an abstraction level, and/or parameters, which eventually makes the code "less simple".
For that matter, I tend to use classes e.g. to remove model saving // trainer reporting from the trainer itself and use instances of new classes. As you said, it's probably a question of working habits. The objective was similar: making it clearer etc, but that's a bit subjective. My point is that it may seem "over-engineered" before it's used: it's intended to be easy to extend, but there's no other implementation that the default one. If you argue that it, in fact, does not even help, then it should be removed indeed.
An example I faced was during the translation server implementation. At that time the translation process was expecting a filename (inputs), and was writing to an output file. That's typically what I consider "over-assuming". I tried to make it possible to use any text iterator as input and to return a list of predictions. I added stuff that probably does not make the code clearer/simpler to read, but it made it easier to use as a library. It's hard to tell how "good" this change really is. It's both "helpful" and "unnecessarily complicated" depending on your point of view.
@vince62s Regarding multigpu, I don't doubt that it's tough and I still need to check it.
@pltrdy This is true: there's always a tension between simplicity and flexibility. But I am often skeptical when the presented solution is to define a separate class that follows the NounVerber
template. There are other (and I would argue better) ways to make code easier to extend. I've been trying to extend the code and have found that there are certain dimensions along which it really does not want to be extended.
Make sense.
I had a very quick look at both your master and the "new trainer branch". As long as it gives the same results and work in multi-gpu I am very fine with it. Maybe as @srush asked me to split in several PR to make things clearer (it was a pain but necessary) it might help to better understand the logics if you do it by steps.
Closing this. Whenever you're ready, just submit your PR. Thanks.
The issue wasn't really about a particular PR, although thank you for reminding me of the trainer work I had sitting around. I finally have enough GPUs to test it.
Sorry I didn't mean one particular PR, but if we want to move in the direction you suggest (and I agree with all of this), we need to discuss over it PR by PR. btw not a big one but I will merge #938
The other issue that I've run into a lot during my own research is the disconnection between the copy attention code and the rest of the modelling code. Trying to harmonise the two so that features work more easily across both would be great. Particularly when new stuff like multi-gpu training isn't available with copy networks.
I agree. One step is described here #862 (see my last comment) But there is more to do I think.
I'm an NLP researcher who does a lot of work with sequence-to-sequence models. Naturally then, my preference is for code that is written in a way that minimizes the time and effort required to implement new features: new models, new loss functions, new types of training data, and so on. In other words, I'm basically the target audience of OpenNMT-py, according to the
README.md
:I think it's a good idea for a person to periodically look back and see whether they are living consistently with the goals you have set. A person and a project for neural machine translation are not exactly the same thing, but grant me this metaphor and let me ask:
Is OpenNMT-py research friendly to try out new ideas at the current moment?
Here's a recent example: I'm trying out a new architecture for which it would be interesting to log some additional stats besides perplexity, accuracy, and so on. So I plunged into the training section of the code to what I needed to alter. I followed
train.py
totrain_multi.py
totrain_single.py
toTrainer.py
. This is already four levels of depth, but the files are well-named and clear in their purpose, and eventually I end up in a method that is concisely namedtrain
. Good so far. I understand how to train by backpropagation, so it should be clear to figure out where and how to log a couple additional numbers, right?Well, actually, not really. I understand the training algorithm, I know exactly when the values I need should be computed, and I know what will get logged when. But the code in
Trainer
is not at all transparent about what actually happens in it. This is the case for a variety of reasons:_gradient_accumulation
, and saving and reporting happens in methods that unhelpfully start with "maybe".train
are careless, overly complicated, and asking for future bugs. Thanks to the break intrain
immediately afterstep
is increment, the condition in the outer while loop never evaluates toFalse
. Having a condition that looks like it determines when training is done is misleading. The total number of iterations through the while loop and the first for loop inside it is num_steps * batches_per_optimizer_update. This is not a difficult loop to get right in a much more parsimonious way: see here for my in-progress solution to some stylistic problems withTrainer
.Trainer
actually does, I need to check not justtrainer.py
but the whole codebase because the attributes could be accessed and altered anywhere. In the particular example ofTrainer
, the rest of the codebase doesn't manipulate these attributes, so they could be made private (well, pseudo-private since python doesn't have real private attributes).NMTLossCompute
sometimes additionally backpropagates gradients and sometimes does not. Also, the loss modules inherit fromnn.Module
but do not use the interface that pytorch modules (including the pytorch loss modules) do. As a person who spends a lot of time writing pytorch code, this is...adversarial.train
, more happens in a pair ofStatistics
objects, and more happens in aReportMgr
object. TheStatistics
object is mutated not just in training, but additionally by the code that computes losses. This mixture of levels of abstraction and distribution of responsibility across several files makes it a pain to hunt down the particular things I need to change to get the code to work. I'm a researcher, not an OOP software engineer -- my ideal level of abstraction is training loops, not classes whose names end in "manager". I understand that the purpose ofReportMgr
is so that you can subclass/overwrite it to manage reporting in a different way. But you could just as easily put areport
method inTrainer
and overwrite it in a subclass that you're using to train a different kind of module that requires different things to be reported. That's simpler, flatter, shorter, requires less cognitive load, and doesn't require a separate class that has no intuitive purpose related to NMT. I would argue that this way much preferable to the project's audience. If you'd like to visualize this, I have a branch where I'm working on it.I don't mean to pick on
Trainer
in particular (there are similar issues elsewhere), and I hope I don't sound too critical of anyone else who has spent long hours working on this project. It's an extremely useful piece of code and I don't know how I'd get anything done without it. But I think there's always room for improvement.