NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.12k stars 1.38k forks source link

The n best performing epoch should be display some place during train session #71

Open VrUnRealEngine4 opened 9 years ago

VrUnRealEngine4 commented 9 years ago

After a long run of several hundred epochs it is often difficult eyeball the epoch performance graph to determine when and where the best epoch occurred.

I believer some place on the graph there should be at least the epoch for the best accuracy.... Additionally, it would be nice to have in the list box of epochs the associated performances..

VrUnRealEngine4 commented 9 years ago

Here is a great reason why we all need to have access to accuracy of all the epochs somewhere on the TESTING page.... Today during my training I was able to eyeball an epoch, 210, which had a accuracy of 99.1%. At the end of the training, of 400 epoch, I wanted to see (eyeballing) the graphs for epoch with hopefully better accuracy; I discovered to my shock that epoch 210 was no longer a part of that performance plot being displayed; the epoch being displayed was 204, 208,212, 216 ....

Granted, at the end of the training there were many epoch with accuracies of 98.98 being displayed.... However we all know that training is not deterministic no matter how sophisticated and robust a gradient decent algorithm being use....

Most people are looking for that outlier path found that are often lost during training. Then a later time we can go back fine tune our training parameters to focus on paths where the outlier was found in hopes of finding better solutions.

It is entirely by luck I happen to check the performance of the system at the time I did; allowing me to see epoch 210.... Which make me wonder if there were any better performing epoch between 210 and 400 that is not being displayed by the performance graph..... after doing a grep (see below) on the of the log files I could not discover any based on my current understanding of the log files.

Just for the record in went in the .digits job directory for the particular job and did

 > grep accuracy caffe_output.log | grep 99

and found

  I0416 08:02:26.351547 14825 solver.cpp:315]     Test net output #0: accuracy = 0.990986

However, I have no idea how to associate this to the any of the epoch I see listed...

lukeyeager commented 9 years ago

I discovered to my shock that epoch 210 was no longer a part of that performance plot being displayed

Good point. I'm intentionally clipping the graph data for performance reasons (see the "stride" variable).

You can still see all of the snapshots in the dropdown list. It would be trivial to add some text there that says "Epoch #30 (99% accuracy)" instead of just "Epoch #30". The problem is that the snapshot interval and the validation interval don't have to be the same. I guess I could interpolate between the nearest accuracies in that case, but that's not technically correct.

There's a larger issue here, though. Just because the model at epoch #210 has the highest accuracy doesn't mean it's actually the best model. If you look at the per-category accuracy instead of the overall accuracy, you can see that the model [in some cases] continues to find a better fit to the data as training goes on, even if the overall accuracy goes down a bit. It's really hard to figure out what the "best model" is, and just looking at the validation accuracy isn't really enough to tell you.

That being said, we still do need a way to show the whole set of accuracy data. Maybe just a "View big graph" button for now.

VrUnRealEngine4 commented 9 years ago

Yes I would tend t agree with you in general, so I just went back and checked the numbers and the graphs... It would seem for this particular case everything was optimum; better than or equal to everything that followed; assuming I am reading the logs and the graphs correctly.... that is a fundamental flaw with these learning algorithms.... you can kicked out of a local minima before fully exploring the surrounding space.... kind of my reason I like a combination of EA and Gradient decent combined ....

VrUnRealEngine4 commented 9 years ago

That is also what I am thinking... I still need time to understand how DIGITS/Caffe works... I will spend my weekend doing that. Hopefully I will understand enough to add these new and desirable features if you do not have the bandwidth to do so within the relatively near future....

DIGITS/Caffe has turned out to be an incredibly usefully tool for me... It would be a shame if I did not try to help make it a production/enterprise quality system.

On Thu, Apr 16, 2015 at 1:02 PM, Luke Yeager notifications@github.com wrote:

I discovered to my shock that epoch 210 was no longer a part of that performance plot being displayed

Good point. I'm intentionally clipping the graph data https://github.com/NVIDIA/DIGITS/blob/7918631c7946a9dc637d27af1aa32ddda281a6af/digits/model/tasks/train.py#L435-L448 for performance reasons (see the "stride" variable).

You can still see all of the snapshots in the dropdown list. It would be trivial to add some text there that says "Epoch #30 https://github.com/NVIDIA/DIGITS/issues/30 (99% accuracy)" instead of just "Epoch #30 https://github.com/NVIDIA/DIGITS/issues/30". The problem is that the snapshot interval and the validation interval don't have to be the same. I guess I could interpolate between the nearest accuracies in that case, but that's not technically correct.

There's a larger issue here, though. Just because the model at epoch #210 has the highest accuracy doesn't mean it's actually the best model. If you look at the per-category accuracy instead of the overall accuracy, you can see that the model [in some cases] continues to find a better fit to the data as training goes on, even if the overall accuracy goes down a bit. It's really hard to figure out what the "best model" is, and just looking at the validation accuracy isn't really enough to tell you.

That being said, we still do need a way to show the whole set of accuracy data. Maybe just a "View big graph" button for now.

— Reply to this email directly or view it on GitHub https://github.com/NVIDIA/DIGITS/issues/71#issuecomment-93785469.

"...Hope is what makes us strong. It is why we are here. It is what we fight with when all is lost..."

lukeyeager commented 9 years ago

That being said, we still do need a way to show the whole set of accuracy data. Maybe just a "View big graph" button for now.

This is done now.

caffeTao commented 9 years ago

I think digits should show the max accuracy in the figure,also the train accuracy

lukeyeager commented 9 years ago

@caffeTao I agree, it would be nice to see the best-performing snapshot posted somewhere. I just haven't been able to make it a high enough priority to get around to it yet.

also the train accuracy

Training accuracy is not a particularly helpful metric. But you can always add it to your model by removing these lines from your Accuracy layer:

  include {
    phase: TEST
  }

training-accuracy