Adjusting training script for early stopping callback

CoCoNuTeK commented 2 weeks ago

Hello there, so i made these adjustments to the train.py added to my config and the script main():

eval_data_paths: str,
patience: int = 20,
frequency: str = "5min"

eval_datasets = [
    Filter(
        partial(
            has_enough_observations,
            min_length=min_past + prediction_length,
            max_missing_prop=max_missing_prop,
        ),
        FileDataset(path=Path(data_path), freq=frequency),
    )
    for data_path in eval_data_paths
]

eval_dataset = ChronosDataset(
        datasets=eval_datasets,
        probabilities=probability,
        tokenizer=chronos_config.create_tokenizer(),
        context_length=context_length,
        prediction_length=prediction_length,
        min_past=min_past,
        mode="validation",
    )

inside the training_args i put

load_best_model_at_end=True,
metric_for_best_model="eval_loss", 
greater_is_better=False,

and inside the Trainer i put

logging_strategy="epoch",
save_strategy="epoch",
eval_strategy="epoch",
eval_dataset=eval_dataset
callbacks=[EarlyStoppingCallback(early_stopping_patience=patience)]

i tried finetuing the 'small' model i put 1K epochs, however i am not quite sure it had any effect as the training ran for full 1K epochs and from serialization i got checkpoint-1000 and also checkpoint-final as outputs of that run

Here i would like to ask, if i set number of epochs to 1K the checkpoint-final contains the best model over the whole training session and the checkpoint-1000 are weights after all the 1K epochs i assume right?

lostella commented 2 weeks ago

@CoCoNuTeK assuming the callback is from HF transformers, this is a question more for https://huggingface.co/docs/transformers/en/main_classes/callback

I guess you're right about checkpoint-1000, not sure about checkpoint-final, but my guess is it contains the model from the step where it decided to stop training.

CoCoNuTeK commented 2 weeks ago

Yeah when i set the log_level="info" i saw both the eval step and the train step logged so thats not the issue, however i am pretty sure its not training on all the data i put in ,because the .arrow file contains arround 100K sequences(time series) in another words this loads

train_datasets = [
        Filter(
            partial(
                has_enough_observations,
                min_length=min_past + prediction_length,
                max_missing_prop=max_missing_prop,
            ),
            FileDataset(path=Path(data_path), freq="h"),
        )
        for data_path in training_data_paths
    ]

arround 130K time series the FileDataset(path=Path(data_path), freq="h"), i only have one data_path I looked into the train.py code multiple times but i cant find if there is some reduction used

this is the console print

***** Running training *****
  Num examples = 640
  Num Epochs = 9,223,372,036,854,775,807
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 10
  Number of trainable parameters = 8,394,496

Or is it because i need to set up the total batch size somewhere to be higher?

CoCoNuTeK commented 2 weeks ago

Nevermind i was using steps instead of epochs i will need to tinker a bit more with the train.py, just to help me out a bit the iter(self) -> Iterator inside of the ChronosDataset does that need to return all the data at once?? in the

        return {
            "input_ids": input_ids.squeeze(0),
            "attention_mask": attention_mask.squeeze(0),
            "labels": labels.squeeze(0),
        }

format??

I wont have missing data so the attention_mask should be all 1s right and then what is the input_ids/labels??

if i have 10 sequences of length lets say 500 the input_ids should be what shape? and same for the labels

thank you very much

CoCoNuTeK commented 2 weeks ago

Okay I solved it.

amazon-science / chronos-forecasting

Adjusting training script for early stopping callback #107