I was wondering if train_model (from allennlp.commands.train import train_model) recovers based on the step that the process died on, or if it recovers on the first step in the epoch?
The reason why I ask is because:
I am running a process where I relaunch train_model with recover=true every 2 epochs because I have out of memory issues. Also keep in mind my config looks like the below for the DataLoader, where the total number of samples in my dataset is ~1M.
My loss demonstrates clear overfitting, where blue is val and orange is train.
As a result, I feel like what might be happening is that train_model is recovering based on the first step in the epoch, so the model is only seeing the same ~120k samples out of the ~1M in the dataset, which results in overfitting. I would love if someone could confirm / give input on this. Thanks!
I was wondering if train_model (
from allennlp.commands.train import train_model
) recovers based on the step that the process died on, or if it recovers on the first step in the epoch?The reason why I ask is because:
As a result, I feel like what might be happening is that train_model is recovering based on the first step in the epoch, so the model is only seeing the same ~120k samples out of the ~1M in the dataset, which results in overfitting. I would love if someone could confirm / give input on this. Thanks!
ccing @zmykevin