crodriguezo / DORi

Public repository for DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video Code accompanying the paper
21 stars 0 forks source link

Training Details #7

Open anilbatra2185 opened 3 months ago

anilbatra2185 commented 3 months ago

hi @crodriguezo,

Can you share some details about the training details such as how long it will take, what is hardware/GPU was used?

Currently, on A100-80GB (24 CPUs), the training is too slow with a batch size of 4 i.e. it is taking 6 hours for 1 epoch. My concerns is reading the object features. Any suggestion to speed-up the training.

Regards

crodriguezo commented 3 months ago

Hi @anilbatra2185,

I appreciate your interest in our work.

When I did the training, I used an M2 disk, which made the process much faster. We used an RTX8000. The disk's reading speed is more critical than the GPU's.

I tried other options when I found the same issue on a different machine. I solved that problem using an H5 file with all the features in one array, and the indices refer to memory space rather than files. That could help, but I cannot find the code for that option.

Best

anilbatra2185 commented 3 months ago

thanks @crodriguezo for your response.

I tried using H5, however, I am unable to save due to empty arrays in object features. Below is my code to save the features in H5, any suggestion to change it.

def load_obj_feat(self):
    self.object_feats = {}
    video_ids = list(set([(ann['video'],ann['subset'],ann['recipe']) for _, ann in self.anns.items()]))
    # for idx, ann in tqdm(self.anns.items(), total=len(self.anns)):
    for video_id, subset, recipe in tqdm(video_ids, total=len(video_ids)):
        selected_frames = self.selected_frames[video_id]
        object_features = []
        human_features  = []
        for selected in selected_frames:
            file_path = os.path.join(self.obj_feat_path, subset, recipe, video_id, "{}_{}.pkl".format("image", str(selected).zfill(5)))
            aux_obj = []
            aux_hum = []
            with open(file_path, "rb") as fo:
                obj_feat = pickle.load(fo, encoding='latin1')
                # print(obj_feat.keys())
                for indx, obj_type in enumerate(obj_feat['object_class']):
                    if self.mapping_obj[str(obj_type)]['human']:
                        aux_hum.append(obj_feat['features'][indx].astype(np.float16))
                    else:
                        aux_obj.append(obj_feat['features'][indx].astype(np.float16))
            if len(aux_obj) == 0:
                aux_obj = np.zeros((1, 2048), dtype=np.float16)
            if len(aux_hum) == 0:
                aux_hum = np.zeros((1, 2048), dtype=np.float16)
            aux_obj = np.array(aux_obj, dtype=np.float16)
            aux_hum = np.array(aux_hum, dtype=np.float16)

            object_features.append(aux_obj)
            human_features.append(aux_hum)

        self.object_feats[video_id] = (object_features, human_features)

    with h5py.File("dori_faster_rcnn_obj_feats.h5", 'w') as f:
        for vid, (obj_feat, human_feat) in tqdm(self.object_feats.items(), total=len(self.object_feats)):
            f.create_dataset(f"{vid}", data=(obj_feat, human_feat))

Currently, I load the object features in-memory and epoch time is reduced to 45 minutes by reducing the precision of object features to float16. However, I wonder if you have any ablations on the number of object features per frame i.e. if we consider only top 5 objects per frame. Also, reducing the number of frames to save the memory has any effect on the performance.

Thanks Anil

crodriguezo commented 3 months ago

Hi @anilbatra2185,

I don't know what dataset you are working on, but the number of objects depends on the type of video. I recall that some frames are just black frames because of transitions to other instructions (youcookII) or the beginning/ending of an action (ActivityNet). However, I also recall that we set a maximum number of objects per keyframe "We extract the top 15 objects detected in terms of confidence for each key-frames using Faster-RCNN." Section 5.1 https://openaccess.thecvf.com/content/WACV2021/papers/Rodriguez-Opazo_DORi_Discovering_Object_Relationships_for_Moment_Localization_of_a_Natural_WACV_2021_paper.pdf

I am re-running the LocFormer(SBFS) experiments with DORi before sharing the code with you. They take 15 minutes per epoch on a Quadro P5000, using independent files (not h5) for the objects and a batch size of 6. However, I am using a M2. (I will share that code ASAP just cleaning the code from other ideas that I never finished)

For training time and inference statistics, see Table 5: https://aclanthology.org/2023.eacl-main.140.pdf.

anilbatra2185 commented 3 months ago

I am working on YouCook2 at the moment. Appreciate you help and efforts!

I will wait for your code sharing of SBFS.

Thanks Anil