Modalities / modalities

A framework for training multimodal foundation models.
MIT License
38 stars 2 forks source link

Improve the dataset inheritance and class naming #167

Open le1nux opened 1 week ago

le1nux commented 1 week ago

As pointed out in the review by @flxst in the PR https://github.com/Modalities/modalities/pull/164, the class inheritance of the dataset classes can be improved:

  1. I think that the class inheritance structure, PackedMemMapDatasetBase -> PackedMemMapDatasetBase, PackedMemMapDatasetContinuous is suboptimal now that the parent class PackedMemMapDatasetBase is not an abstract base class anymore, but actually can be instantiated and used by itself. It might be cleaner to restore an abstract base class with an abstract method _generate_packing_index, and inherit three classes from it with the three different implementations of the method.
  2. The names of the classes are hard to understand and maybe a bit misleading. For instance, the class PackedMemMapDatasetBase does not actually do packing, does it? Also, why Continuous? Maybe we could try to find better names.

Besided the current inheritance issue, we also use the term "packing" in the code, e.g., for the class names such as PackedDataGenerator or PackedMemMapDatasetBase, although no packing is actually happening (we pack on the fly and only tokenize offline). As part of fixing the inheritance structure, we should also enforce the correct terminology.

flxst commented 1 week ago

And while we're at it, we could