Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.56k stars 241 forks source link

Using ijson to load DC.json and E4D.json to save tons of memory. #230

Open Luodian opened 1 year ago

Luodian commented 1 year ago

hi all I just found a better way to load large json files using ijson. Inside the mimicit_dataset.py, you can replace with the following code.

        for cur_mimicit_path, cur_images_path, cur_train_config_path, cur_status in zip(
            self.mimicit_paths, self.images_paths, self.train_config_paths, self.status_list
        ):
            # Load the dataset
            assert os.path.exists(cur_mimicit_path), f"Error: The local mimicit_path {cur_mimicit_path} not exists!"
            with open(cur_mimicit_path, "rb") as f:
                if self.dataset == {}:
                    self.dataset = orjson.loads(f.read())["data"]
                else:
                    self.dataset.update(orjson.loads(f.read())["data"])

            # Load the images
            # if cur_images_path != "":
            # check if file is larger than 100GB
            # use ijson for large files
            with open(cur_images_path, "rb") as f:
                for key, value in ijson.kvitems(f, ""):
                    self.images[key] = value
            #     with open(cur_images_path, "rb") as f:
            #         if not self.images:
            #             self.images = orjson.loads(f.read())
            #         else:
            #             self.images.update(orjson.loads(f.read()))

We will update it later along with requirement.txt in main branch.

Luodian commented 1 year ago

And also you will need to install c backend for yajl with the command sudo apt install libyajl2