I think that the class inheritance structure, PackedMemMapDatasetBase -> PackedMemMapDatasetBase, PackedMemMapDatasetContinuous is suboptimal now that the parent class PackedMemMapDatasetBase is not an abstract base class anymore, but actually can be instantiated and used by itself. It might be cleaner to restore an abstract base class with an abstract method _generate_packing_index, and inherit three classes from it with the three different implementations of the method.
The names of the classes are hard to understand and maybe a bit misleading. For instance, the class PackedMemMapDatasetBase does not actually do packing, does it? Also, why Continuous? Maybe we could try to find better names.
Besided the current inheritance issue, we also use the term "packing" in the code, e.g., for the class names such as PackedDataGenerator or PackedMemMapDatasetBase, although no packing is actually happening (we pack on the fly and only tokenize offline). As part of fixing the inheritance structure, we should also enforce the correct terminology.
change the name of the CLI command modalities data pack_encoded_data (see here), and
create a common CLI command for the data preprocessing that comprises both create_raw_index and pack_encoded_data in a single step, e.g. modalities data tokenize
As pointed out in the review by @flxst in the PR https://github.com/Modalities/modalities/pull/164, the class inheritance of the dataset classes can be improved:
Besided the current inheritance issue, we also use the term "packing" in the code, e.g., for the class names such as
PackedDataGenerator
orPackedMemMapDatasetBase
, although no packing is actually happening (we pack on the fly and only tokenize offline). As part of fixing the inheritance structure, we should also enforce the correct terminology.