PPPLDeepLearning / plasma-python

PPPL deep learning disruption prediction package
http://tigress-web.princeton.edu/~alexeys/docs-web/html/
79 stars 43 forks source link

Directly transmit predictions in parallel instead of implicitly pickling them #71

Closed iMurfyD closed 2 years ago

iMurfyD commented 2 years ago

mpi4py lowercase methods implicitly pickle all data before using them in MPI. This uses the uppercase mpi4py methods, which directly uses MPI with no modifications to data, to transmit the predictions in order to overcome a memory problem when creating multiple predictions of size >1. MPI can only transmit 1D arrays, so the multidimensional prediction numpy.array's need to be flat packed before shipping.

If it fits it ships. One low rate, anywhere in the nation.

iMurfyD commented 2 years ago

Tested latest commit with 4 cores for 4 hours on Traverse. Nothing obviously breaking.

iMurfyD commented 2 years ago

Previous run of FRNN before this path yielded the following ROC curve

epoch 60, test_roc_30 = 0.6952110289965381 test_roc_70 = 0.6622922242137053 test_roc_200 = 0.6426808907657934 test_roc_500 = 0.6200066291831373 test_roc_1000 = 0.5812320459623364
=========Summary======== for epoch 59.64
Training Loss numpy: 3.015e-01
Validation Loss: 4.357e-01
Validation ROC: 0.8893
========================

Using the same conf.yaml and training for about the same amount of time, with the patch, FRNN yields this ROC information:

epoch 74, test_roc_30 = 0.688759851702718 test_roc_70 = 0.6834196763976528 test_roc_200 = 0.6697439171106583 test_roc_500 = 0.6425826806452404 test_roc_1000 = 0.6132853740577966
=========Summary======== for epoch 73.84
Training Loss numpy: 3.113e-01
Validation Loss: 4.549e-01
Validation ROC: 0.8949
========================
No improvement, saving model weights anyways
Finished evaluation of epoch 73.84/1000
Begin training from epoch 73.84/1000
Compilation finished in 0.57s

Further up at a closer epoch:

epoch 55, test_roc_30 = 0.7432142944830465 test_roc_70 = 0.6997747305359818 test_roc_200 = 0.6629397971961011 test_roc_500 = 0.6289897861474625 test_roc_1000 = 0.5856177416582778
=========Summary======== for epoch 54.91
Training Loss numpy: 3.155e-01
Validation Loss: 4.694e-01
Validation ROC: 0.9001
========================
Finished evaluation of epoch 54.91/1000
Begin training from epoch 54.91/1000
Compilation finished in 0.57s

These ROC characteristics match pretty closely, which probably means that this patch didn't affect training.

felker commented 2 years ago

💯 🔥 🚀 LGTM