YouHuang67 / focsam

MIT License
13 stars 2 forks source link

A question about training the decoder #6

Open angelita277 opened 3 months ago

angelita277 commented 3 months ago

When I am trying to train by running the 'train.py', there is an error "'COCOPanopticDataset' object has no attribute 'ignore_sample_indices' ". I found that this is because there needs to be files 'data/meta-info/coco_invalid_image_names.json' and 'data/meta-info/coco.json'. But there is no such files under my 'data/meta-info'. So these 2 files are default setting files or generated by 'train.py'? Why cannot my 'train.py' generate these files? I am sure that the dataset COCO and LVIS are arranged according to the 'DATASET.md'.

YouHuang67 commented 3 months ago

Thank you for reaching out and sorry for the inconvenience. It appears I missed uploading the data/meta-info/coco_invalid_image_names.json file. I have now uploaded it. The other file is a cache file that should be automatically generated during the data loading. The purpose of the coco_invalid_image_names.json file is to identify certain annotations that may be problematic and should be excluded from the training process.

angelita277 commented 3 months ago

I trained the decoder on 4 V100. Here is the log: 2024/06/14 09:55:51 - mmengine - INFO - Checkpoints will be saved to /data2/code/laq/focsam/work_dirs/sam/coco_lvis/train_colaug_coco_lvis_1024x1024_320k. 2024/06/14 10:08:09 - mmengine - INFO - Iter(train) [ 200/320000] lr: 1.3276e-05 eta: 13 days, 15:44:23 time: 3.6894 data_time: 0.0102 memory: 14065 loss: 2.4008 dec.los.nfl_loss: 2.4008 dec.met.binary_iou: 0.9027 2024/06/14 10:20:15 - mmengine - INFO - Iter(train) [ 400/320000] lr: 2.6618e-05 eta: 13 days, 13:04:21 time: 3.6616 data_time: 0.0102 memory: 7740 loss: 1.4424 dec.los.nfl_loss: 1.4424 dec.met.binary_iou: 0.7158 2024/06/14 10:32:25 - mmengine - INFO - Iter(train) [ 600/320000] lr: 3.9960e-05 eta: 13 days, 12:25:21 time: 3.6566 data_time: 0.0102 memory: 7740 loss: 1.0784 dec.los.nfl_loss: 1.0784 dec.met.binary_iou: 0.9195 2024/06/14 10:44:38 - mmengine - INFO - Iter(train) [ 800/320000] lr: 5.3302e-05 eta: 13 days, 12:28:17 time: 3.6595 data_time: 0.0102 memory: 7740 loss: 0.8912 dec.los.nfl_loss: 0.8912 dec.met.binary_iou: 0.7826 2024/06/14 10:56:56 - mmengine - INFO - Exp name: train_colaug_coco_lvis_1024x1024_320k_20240614_095248 2024/06/14 10:56:56 - mmengine - INFO - Iter(train) [ 1000/320000] lr: 6.6644e-05 eta: 13 days, 12:46:51 time: 3.6652 data_time: 0.0101 memory: 7740 loss: 0.7754 dec.los.nfl_loss: 0.7754 dec.met.binary_iou: 0.7738 2024/06/14 11:09:15 - mmengine - INFO - Iter(train) [ 1200/320000] lr: 7.9987e-05 eta: 13 days, 12:59:27 time: 3.6660 data_time: 0.0102 memory: 7739 loss: 0.3575 dec.los.nfl_loss: 0.3575 dec.met.binary_iou: 0.4644 2024/06/14 11:21:33 - mmengine - INFO - Iter(train) [ 1400/320000] lr: 9.3329e-05 eta: 13 days, 13:04:49 time: 3.6779 data_time: 0.0101 memory: 7740 loss: 0.3251 dec.los.nfl_loss: 0.3251 dec.met.binary_iou: 0.8465 2024/06/14 11:33:54 - mmengine - INFO - Iter(train) [ 1600/320000] lr: 9.9969e-05 eta: 13 days, 13:13:53 time: 3.6896 data_time: 0.0101 memory: 7740 loss: 0.3183 dec.los.nfl_loss: 0.3183 dec.met.binary_iou: 0.3718 So... will this takes me 13days to train..?

angelita277 commented 3 months ago

I wonder where the pre-extracted embeddings are stored. 'focsam/data/embeds/colaug_coco_1024x1024_sam_vit_huge' or 'focsam/work_dirs/sam/coco_lvis/train_colaug_coco_lvis_1024x1024_320k'? 'focsam/data/embeds/colaug_coco_1024x1024_sam_vit_huge' is empty. 'focsam/work_dirs/sam/coco_lvis/train_colaug_coco_lvis_1024x1024_320k' only store config.py and logs.

YouHuang67 commented 3 months ago

It appears each training iteration takes over 3 seconds, leading to the estimated training duration. The pre-extracted embeddings are dynamically generated during the training process and are typically stored in focsam/data/embeds/colaug_coco_1024x1024_sam_vit_huge. Please ensure that the path under data rather than work_dirs is used since the work_dirs directory is intended for logs and model weights. As the total size of the embeddings can reach up to 300-400 GB, they are not uploaded but generated on-the-fly using SAM, which should significantly speed up the training once cached. Make sure there is sufficient disk space available to accommodate this.

angelita277 commented 3 months ago

It seems that the pre-extracted embeddings are not stored. Could you tell me which file which file implements the embedding storage function? So I can check what's wrong with it.

angelita277 commented 3 months ago

There is a warning 06/14 14:12:42 - mmengine - WARNING - Failed to search registry with scope "mmseg" in the "embed_loader" registry tree. As a workaround, the current "embed_loader" registry in "focsam" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmseg" is a correct scope, or whether the registry is initialized., so maybe this is the reason?

YouHuang67 commented 3 months ago

It seems that the pre-extracted embeddings are not stored. Could you tell me which file which file implements the embedding storage function? So I can check what's wrong with it.

You'll notice in your training configuration that there's a variable named image_embed_loader with the type BaseEmbedLoader. To find the implementation, you can use the global search function in your IDE, such as PyCharm. Simply select BaseEmbedLoader and double-tap Shift, which should direct you to the corresponding .py file, likely located in focsam/embed_loaders/base.py. Most IDEs offer a similar feature for locating files. This file contains the caching implementation you're looking for. As this repository is developed based on the OpenMMLab codebase, you'll typically need to trace the configuration settings in this manner to find the relevant code.

YouHuang67 commented 3 months ago

There is a warning 06/14 14:12:42 - mmengine - WARNING - Failed to search registry with scope "mmseg" in the "embed_loader" registry tree. As a workaround, the current "embed_loader" registry in "focsam" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmseg" is a correct scope, or whether the registry is initialized., so maybe this is the reason?

It seems likely that the issue stems from the registry problem mentioned in the warning. I will adjust the registry settings.

YouHuang67 commented 3 months ago

There is a warning 06/14 14:12:42 - mmengine - WARNING - Failed to search registry with scope "mmseg" in the "embed_loader" registry tree. As a workaround, the current "embed_loader" registry in "focsam" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmseg" is a correct scope, or whether the registry is initialized., so maybe this is the reason?

It seems likely that the issue stems from the registry problem mentioned in the warning. I will adjust the registry settings.

I've just updated the registration for the embed_loader. This should align the scope correctly and hopefully resolve the issues.

angelita277 commented 3 months ago

There is a file needed naming data/embeds/colaug_coco_1024x1024_sam_vit_huge/meta-info.json. Is this a default file or generated?

angelita277 commented 3 months ago

There is no code storing the caches. The "embed_loader" is only loading the caches to the decoder. But I need to store the embeddings from the encoder.

YouHuang67 commented 3 months ago

You might want to check the implementation starting from line 42 in focsam/embed_loaders/base.py, where the __call__ method handles the primary functionality. This method checks for existing cached embeddings and, if none are found, it generates and stores them, including constructing the meta-info file.

angelita277 commented 3 months ago

well... Here is the code that saves the embeddings: if update_flag: np.save( self.prefix_to_embed_file(prefix), embed.cpu().numpy()) and here is the value of update_flag: update_flag = self.update_prefixes_each_step \ and (max_num_prefixes < 0 or len(self.prefixes) < max_num_prefixes) But update_prefixed_each_step is set to be False in focsam/configs/sam/coco_lvis/train_colaug_coco_lvis_1024x1024_320k.py, so update_flag is always false, and the embedding can never be saved. I set update_prefixed_each_step to True, and the embeddings are saved. Is this the right solution?

YouHuang67 commented 3 months ago

Indeed, setting update_prefixed_each_step to True is necessary. Feel free to test this adjustment. I have also updated the configuration file accordingly. I noticed that in a similar configuration, configs/focsam/coco_lvis/train_colaug_coco_lvis_1024x1024_160k.py, this setting was already set to True. I apologize for any inconvenience.