On the Suitability of OpenNMT-py for research and extension

bpopeters commented 6 years ago

I'm an NLP researcher who does a lot of work with sequence-to-sequence models. Naturally then, my preference is for code that is written in a way that minimizes the time and effort required to implement new features: new models, new loss functions, new types of training data, and so on. In other words, I'm basically the target audience of OpenNMT-py, according to the README.md:

It is designed to be research friendly to try out new ideas in translation, summary, image-to-text, morphology, and many other domains.

I think it's a good idea for a person to periodically look back and see whether they are living consistently with the goals you have set. A person and a project for neural machine translation are not exactly the same thing, but grant me this metaphor and let me ask:

Is OpenNMT-py research friendly to try out new ideas at the current moment?

Here's a recent example: I'm trying out a new architecture for which it would be interesting to log some additional stats besides perplexity, accuracy, and so on. So I plunged into the training section of the code to what I needed to alter. I followed train.py to train_multi.py to train_single.py to Trainer.py. This is already four levels of depth, but the files are well-named and clear in their purpose, and eventually I end up in a method that is concisely named train. Good so far. I understand how to train by backpropagation, so it should be clear to figure out where and how to log a couple additional numbers, right?

Well, actually, not really. I understand the training algorithm, I know exactly when the values I need should be computed, and I know what will get logged when. But the code in Trainer is not at all transparent about what actually happens in it. This is the case for a variety of reasons:

Method naming is opaque. Batches are trained in a method called _gradient_accumulation, and saving and reporting happens in methods that unhelpfully start with "maybe".
The looping conditions in train are careless, overly complicated, and asking for future bugs. Thanks to the break in train immediately after step is increment, the condition in the outer while loop never evaluates to False. Having a condition that looks like it determines when training is done is misleading. The total number of iterations through the while loop and the first for loop inside it is num_steps * batches_per_optimizer_update. This is not a difficult loop to get right in a much more parsimonious way: see here for my in-progress solution to some stylistic problems with Trainer.
The class has many attributes and they are not usefully documented. Attributes are dangerous because they are exposed to external code. If you need to debug something with a local variable in a function, you're quite strictly bounded about where the bug can come from. Not the case with an attribute. In order to figure out what Trainer actually does, I need to check not just trainer.py but the whole codebase because the attributes could be accessed and altered anywhere. In the particular example of Trainer, the rest of the codebase doesn't manipulate these attributes, so they could be made private (well, pseudo-private since python doesn't have real private attributes).
The code needless reimplements functionality that pytorch already handless for us, in this case gradient accumulation: see here.
Various pieces of OpenNMT-py have similar names to pytorch modules but implement subtly different behavior. I'm thinking mostly of optimizers and loss. This is unnecessarily confusing, especially because NMTLossCompute sometimes additionally backpropagates gradients and sometimes does not. Also, the loss modules inherit from nn.Module but do not use the interface that pytorch modules (including the pytorch loss modules) do. As a person who spends a lot of time writing pytorch code, this is...adversarial.
Logging (the whole reason I was delving into this code) in split between several different modules. Some lof it happens in train, more happens in a pair of Statistics objects, and more happens in a ReportMgr object. The Statistics object is mutated not just in training, but additionally by the code that computes losses. This mixture of levels of abstraction and distribution of responsibility across several files makes it a pain to hunt down the particular things I need to change to get the code to work. I'm a researcher, not an OOP software engineer -- my ideal level of abstraction is training loops, not classes whose names end in "manager". I understand that the purpose of ReportMgr is so that you can subclass/overwrite it to manage reporting in a different way. But you could just as easily put a report method in Trainer and overwrite it in a subclass that you're using to train a different kind of module that requires different things to be reported. That's simpler, flatter, shorter, requires less cognitive load, and doesn't require a separate class that has no intuitive purpose related to NMT. I would argue that this way much preferable to the project's audience. If you'd like to visualize this, I have a branch where I'm working on it.

I don't mean to pick on Trainer in particular (there are similar issues elsewhere), and I hope I don't sound too critical of anyone else who has spent long hours working on this project. It's an extremely useful piece of code and I don't know how I'd get anything done without it. But I think there's always room for improvement.

srush commented 6 years ago

Hey @bpopeters, Thanks this is helpful and fair criticisms. Also cc'ing @vince62s .

I think the training code has changed recently to support new features. One of the conflicts is trying to support advanced features like multi-gpu. However, I am all for simplifying the code and enforcing better standards. Maybe we can think of some direct interventions to simplify and document the main pathways of the code.

bpopeters commented 6 years ago

I agree that it's important to support multi-gpu, but issues as far as I can see them are sort of orthogonal to that...I think, anyway. With the changes I've made in my own branch, I don't think I broken anything with multi-gpu (although I can't tell for sure because there's only been one GPU free for me this week).

vince62s commented 6 years ago

Hi @bpopeters , cc'ing @pltrdy too. As Sasha says, this is helpful and we would love to see better code. There many different things in what you raise. First, don't forget the current code is a lot of history done by many people, and sometimes at times where some features were not implemented in pytorch / torchtext. So yes, we may have created classes, objects, and functions that now exist in pytroch directly. Many coders mean not easy to have very clean code.

My main contribution was to refactor trying to make the code clearer and support multi-gpu which was a big leap forward in my view. @pltrdy also contributed and added some new stuff that abstract better stats, reporting, .... I understand it's a bit cumbersome to read when you come after the fact but from a "software" design I really think it's a plus.

Having said that, I fully understand your point of view. I'm more an enterprise user, Paul is a researcher, and we were trying to join both. If you have specific contribution that make things clearer feel free, I think we all want this project to be better every day.

BTW, I have been trying to clean hundreds of issues, most of them now are old features request that nobody is really willing to code PR for, so unsure what to do with all of these.

Cheers.

vince62s commented 6 years ago

Ben, I have not looked at your trainer.py code, but multi-gpu is VERY tricky. the main point is that each process has to load batches spaced by each n gpu and you have to make sure that each process ends at the same time, ie process the exact same number of batches, or pad if needed,otherwise one process may be still hanging and wait for ever.

I had very hard time to debug it at the beginning, so you would really need to make sur it works with several gpu.

pltrdy commented 6 years ago

Interesting discussion. I personally find it pretty hard to keep something "simple" at each level of abstraction. Implementing a feature as simply as possible may result in over-assuming how people will use it, and/or restricting possibility to extend it. To solve it, one may add an abstraction level, and/or parameters, which eventually makes the code "less simple".

For that matter, I tend to use classes e.g. to remove model saving // trainer reporting from the trainer itself and use instances of new classes. As you said, it's probably a question of working habits. The objective was similar: making it clearer etc, but that's a bit subjective. My point is that it may seem "over-engineered" before it's used: it's intended to be easy to extend, but there's no other implementation that the default one. If you argue that it, in fact, does not even help, then it should be removed indeed.

An example I faced was during the translation server implementation. At that time the translation process was expecting a filename (inputs), and was writing to an output file. That's typically what I consider "over-assuming". I tried to make it possible to use any text iterator as input and to return a list of predictions. I added stuff that probably does not make the code clearer/simpler to read, but it made it easier to use as a library. It's hard to tell how "good" this change really is. It's both "helpful" and "unnecessarily complicated" depending on your point of view.

bpopeters commented 6 years ago

@vince62s Regarding multigpu, I don't doubt that it's tough and I still need to check it.

@pltrdy This is true: there's always a tension between simplicity and flexibility. But I am often skeptical when the presented solution is to define a separate class that follows the NounVerber template. There are other (and I would argue better) ways to make code easier to extend. I've been trying to extend the code and have found that there are certain dimensions along which it really does not want to be extended.

pltrdy commented 6 years ago

Make sense.

vince62s commented 6 years ago

I had a very quick look at both your master and the "new trainer branch". As long as it gives the same results and work in multi-gpu I am very fine with it. Maybe as @srush asked me to split in several PR to make things clearer (it was a pain but necessary) it might help to better understand the logics if you do it by steps.

vince62s commented 6 years ago

Closing this. Whenever you're ready, just submit your PR. Thanks.

bpopeters commented 6 years ago

The issue wasn't really about a particular PR, although thank you for reminding me of the trainer work I had sitting around. I finally have enough GPUs to test it.

vince62s commented 6 years ago

Sorry I didn't mean one particular PR, but if we want to move in the direction you suggest (and I agree with all of this), we need to discuss over it PR by PR. btw not a big one but I will merge #938

Henry-E commented 6 years ago

The other issue that I've run into a lot during my own research is the disconnection between the copy attention code and the rest of the modelling code. Trying to harmonise the two so that features work more easily across both would be great. Particularly when new stuff like multi-gpu training isn't available with copy networks.

vince62s commented 6 years ago

I agree. One step is described here #862 (see my last comment) But there is more to do I think.

OpenNMT / OpenNMT-py

On the Suitability of OpenNMT-py for research and extension #883