Closed CoCoNuTeK closed 2 weeks ago
@CoCoNuTeK assuming the callback is from HF transformers, this is a question more for https://huggingface.co/docs/transformers/en/main_classes/callback
I guess you're right about checkpoint-1000, not sure about checkpoint-final, but my guess is it contains the model from the step where it decided to stop training.
Yeah when i set the log_level="info" i saw both the eval step and the train step logged so thats not the issue, however i am pretty sure its not training on all the data i put in ,because the .arrow file contains arround 100K sequences(time series) in another words this loads
train_datasets = [
Filter(
partial(
has_enough_observations,
min_length=min_past + prediction_length,
max_missing_prop=max_missing_prop,
),
FileDataset(path=Path(data_path), freq="h"),
)
for data_path in training_data_paths
]
arround 130K time series the FileDataset(path=Path(data_path), freq="h"), i only have one data_path I looked into the train.py code multiple times but i cant find if there is some reduction used
this is the console print
***** Running training *****
Num examples = 640
Num Epochs = 9,223,372,036,854,775,807
Instantaneous batch size per device = 64
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 1
Total optimization steps = 10
Number of trainable parameters = 8,394,496
Or is it because i need to set up the total batch size somewhere to be higher?
Nevermind i was using steps instead of epochs i will need to tinker a bit more with the train.py, just to help me out a bit the iter(self) -> Iterator inside of the ChronosDataset does that need to return all the data at once?? in the
return {
"input_ids": input_ids.squeeze(0),
"attention_mask": attention_mask.squeeze(0),
"labels": labels.squeeze(0),
}
format??
I wont have missing data so the attention_mask should be all 1s right and then what is the input_ids/labels??
if i have 10 sequences of length lets say 500 the input_ids should be what shape? and same for the labels
thank you very much
Okay I solved it.
Hello there, so i made these adjustments to the train.py added to my config and the script main():
inside the training_args i put
and inside the Trainer i put
i tried finetuing the 'small' model i put 1K epochs, however i am not quite sure it had any effect as the training ran for full 1K epochs and from serialization i got checkpoint-1000 and also checkpoint-final as outputs of that run
Here i would like to ask, if i set number of epochs to 1K the checkpoint-final contains the best model over the whole training session and the checkpoint-1000 are weights after all the 1K epochs i assume right?