Modalities / modalities

A framework for training multimodal foundation models.
MIT License
39 stars 3 forks source link

Fix failing test_e2e_training_run_wout_ckpt #98

Closed le1nux closed 3 months ago

le1nux commented 4 months ago

It looks like there are issues with the loading of the dataset.

/workspaces/modalities/tests/test_main.py::test_e2e_training_run_wout_ckpt failed: monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f31501a07d0>
indexed_dummy_data_path = DataPathCollection(raw_data_path=PosixPath('/tmp/pytest-of-root/pytest-10/test_e2e_training_run_wout_ckp0/lorem_ipsum.jsonl'), index_path=PosixPath('/tmp/pytest-of-root/pytest-10/test_e2e_training_run_wout_ckp0/lorem_ipsum.idx'))
dummy_config = ({'batch_progress_subscriber': {'component_key': 'progress_subscriber', 'config': {'eval_dataloaders': {'instance_key'...loader', 'pass_type': 'BY_REFERENCE'}], ...}, PosixPath('/workspaces/modalities/config_files/config_lorem_ipsum.yaml'))

    @pytest.mark.skipif(torch.cuda.device_count() < 1, reason="This e2e test requires 1 GPU.")
    def test_e2e_training_run_wout_ckpt(monkeypatch, indexed_dummy_data_path, dummy_config):
        # patch in env variables
        monkeypatch.setenv("MASTER_ADDR", "localhost")
        monkeypatch.setenv("MASTER_PORT", "9948")
        config_dict, config_path = dummy_config
        print(indexed_dummy_data_path.raw_data_path)
        config_dict["train_dataset"]["config"]["raw_data_path"] = indexed_dummy_data_path.raw_data_path
        main = Main(config_dict, config_path)
>       main.run()

tests/test_main.py:16: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/modalities/__main__.py:198: in run
    components: TrainingComponentsModel = self.component_factory.build_components(
src/modalities/config/component_factory.py:16: in build_components
    component_dict = self._build_config(config_dict=config_dict, component_names=component_names)
src/modalities/config/component_factory.py:23: in _build_config
    components, _ = self._build_component(
src/modalities/config/component_factory.py:49: in _build_component
    materialized_component_config[sub_entity_key], top_level_components = self._build_component(
src/modalities/config/component_factory.py:49: in _build_component
    materialized_component_config[sub_entity_key], top_level_components = self._build_component(
src/modalities/config/component_factory.py:49: in _build_component
    materialized_component_config[sub_entity_key], top_level_components = self._build_component(
src/modalities/config/component_factory.py:84: in _build_component
    materialized_referenced_component, top_level_components = self._build_component(
src/modalities/config/component_factory.py:67: in _build_component
    component = self._instantiate_component(
src/modalities/config/component_factory.py:134: in _instantiate_component
    component = component_type(**component_config_dict)
src/modalities/dataloader/dataset_factory.py:63: in get_packed_mem_map_dataset_continuous
    dataset = PackedMemMapDatasetContinuous(
src/modalities/dataloader/dataset.py:143: in __init__
    self._embedded_stream_data = EmbeddedStreamData(raw_data_path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <modalities.dataloader.create_packed_data.EmbeddedStreamData object at 0x7f3150410310>
data_path = PosixPath('/tmp/pytest-of-root/pytest-10/test_e2e_training_run_wout_ckp0/lorem_ipsum.jsonl')

    def __init__(self, data_path: Path):
        self._data_path = data_path
        if not self._data_path.is_file():
            raise FileNotFoundError(
                f"Packed Data was not found at {self._data_path}."
                f"Create on in advance by using `modalities data pack_encoded_data`."
            )

        with self._data_path.open("rb") as f:
            # get number of bytes in data section
            data_section_length_in_bytes = f.read(self.DATA_SECTION_LENGTH_IN_BYTES)
            self.data_len = int.from_bytes(data_section_length_in_bytes, byteorder="little")

            # get number of bytes for encoding a single token
            f.seek(self.DATA_SECTION_LENGTH_IN_BYTES)
            token_size_as_bytes = f.read(self.TOKEN_SIZE_DESCRIPTOR_LENGTH_IN_BYTES)
            self.token_size_in_bytes = int.from_bytes(token_size_as_bytes, byteorder="little", signed=False)

            # get index
>           f.seek(self.HEADER_SIZE_IN_BYTES + self.data_len)
E           OSError: [Errno 22] Invalid argument

src/modalities/dataloader/create_packed_data.py:212: OSError
luzian-hahn commented 4 months ago

The current state in main (4d9218f51e867331d92de0b068db2b1b7a3da726) contains dgx-specific paths. Probably dgx2. I want to emphasize again that this is a bad idea. Especially modifying test resources like the lorem_ipsum config (e.g. here).

Git blame indicates @mali-git . Please don't do this. This makes the tool unusable on different devices and undermines the stability of the tests and invites people to ignore them even more.

luzian-hahn commented 4 months ago

Seems like this problem was introduced with the tokenizer merge. (At least using the previous state of 60feafe29ec882939202be4e88892bbcde2e53f5 Are you sure about the code's stability for the usage on Taurus? @le1nux, @mali-git

luzian-hahn commented 3 months ago

I was able to fix the tests. They pass in my local setup. Since the Ci is still broken due to our non-GPU setup and flash-attn being integrated, I verified it locally by creating a dockerized environment. I did not make the utilities for the Docker build part of this PR. But you can try reproducing it by checking out this tagged version: https://github.com/Modalities/modalities/tree/dockerized-pytester