OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.75k stars 2.25k forks source link

On the Suitability of OpenNMT-py for research and extension #883

Closed bpopeters closed 6 years ago

bpopeters commented 6 years ago

I'm an NLP researcher who does a lot of work with sequence-to-sequence models. Naturally then, my preference is for code that is written in a way that minimizes the time and effort required to implement new features: new models, new loss functions, new types of training data, and so on. In other words, I'm basically the target audience of OpenNMT-py, according to the README.md:

It is designed to be research friendly to try out new ideas in translation, summary, image-to-text, morphology, and many other domains.

I think it's a good idea for a person to periodically look back and see whether they are living consistently with the goals you have set. A person and a project for neural machine translation are not exactly the same thing, but grant me this metaphor and let me ask:

Is OpenNMT-py research friendly to try out new ideas at the current moment?

Here's a recent example: I'm trying out a new architecture for which it would be interesting to log some additional stats besides perplexity, accuracy, and so on. So I plunged into the training section of the code to what I needed to alter. I followed train.py to train_multi.py to train_single.py to Trainer.py. This is already four levels of depth, but the files are well-named and clear in their purpose, and eventually I end up in a method that is concisely named train. Good so far. I understand how to train by backpropagation, so it should be clear to figure out where and how to log a couple additional numbers, right?

Well, actually, not really. I understand the training algorithm, I know exactly when the values I need should be computed, and I know what will get logged when. But the code in Trainer is not at all transparent about what actually happens in it. This is the case for a variety of reasons:

I don't mean to pick on Trainer in particular (there are similar issues elsewhere), and I hope I don't sound too critical of anyone else who has spent long hours working on this project. It's an extremely useful piece of code and I don't know how I'd get anything done without it. But I think there's always room for improvement.

srush commented 6 years ago

Hey @bpopeters, Thanks this is helpful and fair criticisms. Also cc'ing @vince62s .

I think the training code has changed recently to support new features. One of the conflicts is trying to support advanced features like multi-gpu. However, I am all for simplifying the code and enforcing better standards. Maybe we can think of some direct interventions to simplify and document the main pathways of the code.

bpopeters commented 6 years ago

I agree that it's important to support multi-gpu, but issues as far as I can see them are sort of orthogonal to that...I think, anyway. With the changes I've made in my own branch, I don't think I broken anything with multi-gpu (although I can't tell for sure because there's only been one GPU free for me this week).

vince62s commented 6 years ago

Hi @bpopeters , cc'ing @pltrdy too. As Sasha says, this is helpful and we would love to see better code. There many different things in what you raise. First, don't forget the current code is a lot of history done by many people, and sometimes at times where some features were not implemented in pytorch / torchtext. So yes, we may have created classes, objects, and functions that now exist in pytroch directly. Many coders mean not easy to have very clean code.

My main contribution was to refactor trying to make the code clearer and support multi-gpu which was a big leap forward in my view. @pltrdy also contributed and added some new stuff that abstract better stats, reporting, .... I understand it's a bit cumbersome to read when you come after the fact but from a "software" design I really think it's a plus.

Having said that, I fully understand your point of view. I'm more an enterprise user, Paul is a researcher, and we were trying to join both. If you have specific contribution that make things clearer feel free, I think we all want this project to be better every day.

BTW, I have been trying to clean hundreds of issues, most of them now are old features request that nobody is really willing to code PR for, so unsure what to do with all of these.

Cheers.

vince62s commented 6 years ago

Ben, I have not looked at your trainer.py code, but multi-gpu is VERY tricky. the main point is that each process has to load batches spaced by each n gpu and you have to make sure that each process ends at the same time, ie process the exact same number of batches, or pad if needed,otherwise one process may be still hanging and wait for ever.

I had very hard time to debug it at the beginning, so you would really need to make sur it works with several gpu.

pltrdy commented 6 years ago

Interesting discussion. I personally find it pretty hard to keep something "simple" at each level of abstraction. Implementing a feature as simply as possible may result in over-assuming how people will use it, and/or restricting possibility to extend it. To solve it, one may add an abstraction level, and/or parameters, which eventually makes the code "less simple".

For that matter, I tend to use classes e.g. to remove model saving // trainer reporting from the trainer itself and use instances of new classes. As you said, it's probably a question of working habits. The objective was similar: making it clearer etc, but that's a bit subjective. My point is that it may seem "over-engineered" before it's used: it's intended to be easy to extend, but there's no other implementation that the default one. If you argue that it, in fact, does not even help, then it should be removed indeed.

An example I faced was during the translation server implementation. At that time the translation process was expecting a filename (inputs), and was writing to an output file. That's typically what I consider "over-assuming". I tried to make it possible to use any text iterator as input and to return a list of predictions. I added stuff that probably does not make the code clearer/simpler to read, but it made it easier to use as a library. It's hard to tell how "good" this change really is. It's both "helpful" and "unnecessarily complicated" depending on your point of view.

bpopeters commented 6 years ago

@vince62s Regarding multigpu, I don't doubt that it's tough and I still need to check it.

@pltrdy This is true: there's always a tension between simplicity and flexibility. But I am often skeptical when the presented solution is to define a separate class that follows the NounVerber template. There are other (and I would argue better) ways to make code easier to extend. I've been trying to extend the code and have found that there are certain dimensions along which it really does not want to be extended.

pltrdy commented 6 years ago

Make sense.

vince62s commented 6 years ago

I had a very quick look at both your master and the "new trainer branch". As long as it gives the same results and work in multi-gpu I am very fine with it. Maybe as @srush asked me to split in several PR to make things clearer (it was a pain but necessary) it might help to better understand the logics if you do it by steps.

vince62s commented 6 years ago

Closing this. Whenever you're ready, just submit your PR. Thanks.

bpopeters commented 6 years ago

The issue wasn't really about a particular PR, although thank you for reminding me of the trainer work I had sitting around. I finally have enough GPUs to test it.

vince62s commented 6 years ago

Sorry I didn't mean one particular PR, but if we want to move in the direction you suggest (and I agree with all of this), we need to discuss over it PR by PR. btw not a big one but I will merge #938

Henry-E commented 6 years ago

The other issue that I've run into a lot during my own research is the disconnection between the copy attention code and the rest of the modelling code. Trying to harmonise the two so that features work more easily across both would be great. Particularly when new stuff like multi-gpu training isn't available with copy networks.

vince62s commented 6 years ago

I agree. One step is described here #862 (see my last comment) But there is more to do I think.