Model save with iteration numbers

JinseokNam commented 8 years ago

For model selection, we need multiple models having different iteration numbers.

bartvm commented 8 years ago

Why do we need this for model selection? The model is only saved if the validation cost is a new best, so the model saved is already the best one.

In general I'd be a bit worried about saving all these models; we'll quickly end up filling the file system with hundreds of them.

JinseokNam commented 8 years ago

Although the negative log-likelihood on the validation set is a good indicator to be used for model selection, the model we want to have is one performs best in terms of evaluation metrics such as BLEU.

Due to the computational cost of translating source sentences (3000 sentences) into a target language using beam search, we can't compute such metrics during the training course. Otherwise, the training process needs to be held until the evaluation process completes, which may take at least more than 30 minutes on a single GPU.

This change might be less useful if we are willing to pay GPUs for translating validation sentences over multi GPUs. However, it is still necessary one may have only a single GPU.

As you pointed out, this will eat up disk space quite quickly, so that we need to spare free space of several GBs depending on the interval of model save for a single experimental configuration. I guess that this shouldn't exceed 1 TB if we save models every 1~2 hours.

bartvm commented 8 years ago

1TB is way too much I'm afraid, we actually just got an e-mail again about the file system being full. We get these on a weekly basis. :confused:

I believe that the way we used to do this is by saving the parameters to disk and then performing the BLEU score computation on CPU, using that for early stopping. Slightly harder to implement, but it saves a lot of disk space, doesn't cost valuable GPU cycles, and we can actually do the early stopping properly (otherwise it's possible that we stop too early because we are overfitting according to NLL, but were still improving the BLEU score).

JinseokNam commented 8 years ago

The resource shortage is always the problem. :disappointed:

The size of a single parameter set with the current model configuration is around 400 MB. If a NMT model runs for 5 days and exports its parameters every 2 hours, we will have 60 model snapshots whose size in total is about 24 GB.

I've added an option allowing users to choose to overwrite models previously stored to disk.

Please keep in mind that model evaluation takes several hours even with multi-threaded translation code (I'll check timing on 8 threads), which means that we have to wait several hours just to determine if the current model is better than the previous best model in terms of BLEU.

lamblin commented 8 years ago

Temporarily storing data on the local filesystem (/Tmp/... for instance) should be OK, even if the shared filesystems are limited.

bartvm commented 8 years ago

So the Blocks implementation of NMT was developed while I was away, but it performs early stopping on BLEU. To limit the number of evaluations done it seems to use a "burn-in" period but otherwise it seems to simply calculate the BLEU score on the GPU. It seems strange that they would do this if the cost is so prohibitive, so I asked @orhanf about it, but he hasn't gotten back to me yet.

Is there perhaps a difference in settings that would result in different computation times? The Blocks code seems to use maxlen=3 * len(target_sentence) with a beam size of 12 every 5000 iterations. Is that similar to what you have tried, @JinseokNam?

Instead of saving every single model we could at least save the best K models according to NLL and only evaluate those for BLEU the end, but that doesn't solve the issue of stopping too early I guess.

bartvm commented 8 years ago

@orhanf just got back to me and told me that although the code in blocks-examples does early stopping on BLEU, he didn't actually use it himself because it's too slow. Instead he used to save the model to disk and use a second GPU to calculate the BLEU scores in parallel, and then do early stopping based on that.

In his current codebase he uses CPU to calculate the BLEU scores though. He says that with 40-50 cores--he's at IBM right now--it doesn't take more than 10 minutes. The Keplers only have 16 cores, but they're pretty decent Haswell ones if I remember correctly, so would be good to know how long it would take when using (almost) all of them to see if this is feasible for us as well?

JinseokNam commented 8 years ago

I ran the translation code taken from Cho on 3000 sentences with 4 cores (hyper threaded, i7-4790, personal machine). It took 40 minutes when the beam size is 5. With the beam size of 20, it doesn't take so long, 90 minutes. Although I didn't calculate BLEU using translated results, the bigger the beam size is, the better translation we would obtain.

If the validation interval set to 120 minutes or longer, we don't need to save a model with iteration number for computing BLEU.

bartvm commented 8 years ago

I'm assuming you mean MKL used 4 threads bound to the 4 physical cores; hyper-threading doesn't make much sense in this case, and MKL's default behaviour is to limit the number of threads to the number of physical cores in order to avoid unnecessarily switching between logical cores that use the same underlying physical core.

Either way, that's not too bad! Keplers are slightly slower probably, but they have twice as many cores, and MKL scales quite nicely. Moreover, 20 is a pretty big beam, I've mostly seen 10-15 used (e.g. in Sebastien's paper it's 12), so I think we might away with validating once an hour, which is pretty okay!

JinseokNam commented 8 years ago

Instead of saving intermediate models into the disk, we can evaluate the generalization performance of intermediate models on the fly via #58.

bartvm / nmt

Model save with iteration numbers #52