carbonscott / maxie

Masked Autoencoder for X-ray Image Encoding (MAXIE)
Other
1 stars 4 forks source link

Support monitoring the training with debugging codes #16

Open carbonscott opened 1 week ago

carbonscott commented 1 week ago

Monitor the following:

carbonscott commented 1 week ago

Added maxie/utils/debug.py

carbonscott commented 1 week ago

Check out an example training dynamics in maxie/train/analysis/parse_log.ipynb.

The percent parameter update in log10 scale for every transformer block is shown below, which indicates I should bump up the base learning rate and perhaps the warm up should take about 200-300 iterations. In addition, some layers have a much faster param update rate than others (they learn faster).

Screenshot 2024-06-28 at 10 38 00 PM
carbonscott commented 1 week ago

Monitoring training dynamics will significantly impact the mfu (it should reach 0.14 on its high end).

rank=0 | logevent=LOSS:TRAIN | iteration=2001 | segment=0-48 | learning_rate=0.00014982407391299707 | grad_norm=0.027345 | mean_train_loss=0.118111 | tokens_per_sec=4.4e+05 | mfu_per_iteration=0.092 | grad_nosync_counter=2
rank=0 | logevent=LOSS:TRAIN | iteration=2002 | segment=0-48 | learning_rate=0.00014982407391299707 | grad_norm=0.026358 | mean_train_loss=0.132663 | tokens_per_sec=5.6e+05 | mfu_per_iteration=0.116 | grad_nosync_counter=2
rank=0 | logevent=LOSS:TRAIN | iteration=2003 | segment=48-96 | learning_rate=0.00014982407391299707 | grad_norm=0.039368 | mean_train_loss=0.122480 | tokens_per_sec=4.5e+05 | mfu_per_iteration=0.093 | grad_nosync_counter=2
rank=0 | logevent=LOSS:TRAIN | iteration=2004 | segment=48-96 | learning_rate=0.00014982407391299707 | grad_norm=0.022546 | mean_train_loss=0.126299 | tokens_per_sec=5.8e+05 | mfu_per_iteration=0.120 | grad_nosync_counter=2
rank=0 | logevent=LOSS:TRAIN | iteration=2005 | segment=96-144 | learning_rate=0.00014982407391299707 | grad_norm=0.080875 | mean_train_loss=0.141968 | tokens_per_sec=4.1e+05 | mfu_per_iteration=0.085 | grad_nosync_counter=2
rank=0 | logevent=LOSS:TRAIN | iteration=2006 | segment=96-144 | learning_rate=0.00014982407391299707 | grad_norm=0.046236 | mean_train_loss=0.144224 | tokens_per_sec=4.9e+05 | mfu_per_iteration=0.103 | grad_nosync_counter=2
rank=0 | logevent=LOSS:TRAIN | iteration=2007 | segment=144-192 | learning_rate=0.00014982407391299707 | grad_norm=0.036416 | mean_train_loss=0.152639 | tokens_per_sec=3.3e+05 | mfu_per_iteration=0.068 | grad_nosync_counter=2
rank=0 | logevent=LOSS:TRAIN | iteration=2008 | segment=144-192 | learning_rate=0.00014982265285473525 | grad_norm=0.049307 | mean_train_loss=0.136679 | tokens_per_sec=5.9e+05 | mfu_per_iteration=0.123 | grad_nosync_counter=2
rank=0 | logevent=LOSS:TRAIN | iteration=2009 | segment=192-240 | learning_rate=0.00014982265285473525 | grad_norm=0.031383 | mean_train_loss=0.128409 | tokens_per_sec=4.1e+05 | mfu_per_iteration=0.086 | grad_nosync_counter=2