USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems
Creative Commons Zero v1.0 Universal
21 stars 15 forks source link

comparing performance local/cloud/tall grass #10

Closed jsadler2 closed 4 years ago

jsadler2 commented 4 years ago

I want to compare performance of the model training on local/cloud/tall grass.

jsadler2 commented 4 years ago

For cloud (10 epochs): 4:18

Epoch 1/10
1008/1008 [==============================] - 142s 141ms/sample - loss: 0.9307
Epoch 2/10
1008/1008 [==============================] - 6s 5ms/sample - loss: 0.5909
Epoch 3/10
1008/1008 [==============================] - 5s 5ms/sample - loss: 0.5165
Epoch 4/10
1008/1008 [==============================] - 6s 5ms/sample - loss: 0.4578
Epoch 5/10
1008/1008 [==============================] - 6s 5ms/sample - loss: 0.3343
Epoch 6/10
1008/1008 [==============================] - 5s 5ms/sample - loss: 0.1991
Epoch 7/10
1008/1008 [==============================] - 6s 5ms/sample - loss: 0.1476
Epoch 8/10
1008/1008 [==============================] - 6s 5ms/sample - loss: 0.1223
Epoch 9/10
1008/1008 [==============================] - 6s 5ms/sample - loss: 0.1075
Epoch 10/10
1008/1008 [==============================] - 6s 5ms/sample - loss: 0.0972
elapsed time:  0:04:18.985201
jsadler2 commented 4 years ago

Local: 13:04

Epoch 1/10
1008/1008 [==============================] - 113s 112ms/sample - loss: 0.9144
Epoch 2/10
1008/1008 [==============================] - 21s 21ms/sample - loss: 0.5850
Epoch 3/10
1008/1008 [==============================] - 29s 29ms/sample - loss: 0.5144
Epoch 4/10
1008/1008 [==============================] - 39s 39ms/sample - loss: 0.4204
Epoch 5/10
1008/1008 [==============================] - 52s 52ms/sample - loss: 0.2748
Epoch 6/10
1008/1008 [==============================] - 72s 71ms/sample - loss: 0.1854
Epoch 7/10
1008/1008 [==============================] - 81s 81ms/sample - loss: 0.1470
Epoch 8/10
1008/1008 [==============================] - 90s 89ms/sample - loss: 0.1249
Epoch 9/10
1008/1008 [==============================] - 104s 103ms/sample - loss: 0.1102
Epoch 10/10
1008/1008 [==============================] - 120s 119ms/sample - loss: 0.1001
elapsed time: 0:13:04.635053
jsadler2 commented 4 years ago

A couple things to note:

  1. The first epochs are a lot longer than the other epochs (except the last local one). I think this is b/c the software first has to build the computational graph.
  2. The first local epoch is actually a bit faster than the first cloud epoch. Not sure why that is. Maybe this is built by the cpu and my local cpu is beefier than the cloud one?
  3. The epochs get gradually slower as they increase in the local until epoch 10 takes 5x the time of epoch 2.
  4. In contrast, on the cloud, the epochs are pretty much the same 2-10. I bet this has something to do with the GPU.
aappling-usgs commented 4 years ago
  1. Yeah, or memory runs out locally?
jsadler2 commented 4 years ago

Yeah, or memory runs out locally?

Mmm. I should monitor this. My local machine has more total memory, but a lot of that is eaten up by my Windows/Browser.

But what is being stored? Why would it accumulate like that?

aappling-usgs commented 4 years ago

Good point - if it is accumulating, it's probably because the code isn't as it should be. I shared my experience last week of realizing I'd been creating new tensors within the training loop (each batch), which caused me to see similar problems. Are you doing anything like that, or even just appending predictions or similar to an array, in the training loop? It does seem that monitoring memory use locally could give you a clue about whether this is even a candidate explanation.

aappling-usgs commented 4 years ago

(And yeah, I just looked at your code again and have no idea how you could be making that mistake given the constraints of the tf2 interface - it's just .compile() and .fit()!)

jsadler2 commented 4 years ago

Yeah. TF2 makes it pretty slick/hard to mess up!

jsadler2 commented 4 years ago

I believe the above times were for just the subset of the DRB. Here's some numbers for Tallgrass for the entire DRB:

Train on 10944 samples
Epoch 1/200
10944/10944 [==============================] - 140s 13ms/sample - loss: 0.9918
Epoch 2/200
10944/10944 [==============================] - 5s 477us/sample - loss: 0.9241
Epoch 3/200
10944/10944 [==============================] - 5s 470us/sample - loss: 0.9068
Epoch 4/200
10944/10944 [==============================] - 5s 473us/sample - loss: 0.8874
Epoch 5/200
10944/10944 [==============================] - 5s 470us/sample - loss: 0.8827
Epoch 6/200
10944/10944 [==============================] - 5s 471us/sample - loss: 0.8784
Epoch 7/200
10944/10944 [==============================] - 5s 470us/sample - loss: 0.8817
Epoch 8/200
10944/10944 [==============================] - 5s 468us/sample - loss: 0.9051
Epoch 9/200
10944/10944 [==============================] - 5s 470us/sample - loss: 0.9173
Epoch 10/200
10944/10944 [==============================] - 5s 469us/sample - loss: 0.8868

so per sample, much much faster than Local and even quite a bit faster than the Pangeo instance. Though I don't remember seeing such a difference, so I'd like to go back and do a direct comparison.