What does this PR do?

Previously, the block_size in the dataset would be set to a power of two, resulting in the sequence length being block_size -1, which is not best practice and can impact the model training e.g., throughput-wise.

As a fix, we now specify the sequence_length in the config instead of the block_size. During Dataset instantiation we chose the block_size to be sequence_length+1.

Previously, we would also chunk the dataset into block_size long chunks. Each chunk would then be used for training individually. As a result, the last token of a block would be only used as a target but never as an input. We changed this, such that we reuse the last token of a batch as the first one of the subsequent batch.

General changes

nothing apart from points mentioned above

Breaking Changes

replaced block_size in Dataset, Model and NumberConversion with sequence_length

Checklist before submitting final PR

[x] My PR is minimal and addresses one issue / enhancement in isolation
[ ] I have merged main into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[x] I have run a sample config for model training
[x] I have fixed all failing tests (python tests/tests.py)

Modalities / modalities

Fix/sequence length power of 2 #158

What does this PR do?

General changes

Breaking Changes

Checklist before submitting final PR