allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.72k stars 2.25k forks source link

train_model recover protocol #5653

Closed g-luo closed 2 years ago

g-luo commented 2 years ago

I was wondering if train_model (from allennlp.commands.train import train_model) recovers based on the step that the process died on, or if it recovers on the first step in the epoch?

The reason why I ask is because:

As a result, I feel like what might be happening is that train_model is recovering based on the first step in the epoch, so the model is only seeing the same ~120k samples out of the ~1M in the dataset, which results in overfitting. I would love if someone could confirm / give input on this. Thanks!

 batch_size: 16
 max_instances_in_memory: 8192
 biggest_batch_first: false
 instances_per_epoch: 65536
 maximum_samples_per_batch: ["num_tokens", 16384]
Screen Shot 2022-06-02 at 10 42 37 AM

ccing @zmykevin

github-actions[bot] commented 2 years ago

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇