Open angelita277 opened 3 months ago
Thank you for reaching out and sorry for the inconvenience. It appears I missed uploading the data/meta-info/coco_invalid_image_names.json
file. I have now uploaded it. The other file is a cache file that should be automatically generated during the data loading. The purpose of the coco_invalid_image_names.json
file is to identify certain annotations that may be problematic and should be excluded from the training process.
I trained the decoder on 4 V100. Here is the log:
2024/06/14 09:55:51 - mmengine - INFO - Checkpoints will be saved to /data2/code/laq/focsam/work_dirs/sam/coco_lvis/train_colaug_coco_lvis_1024x1024_320k. 2024/06/14 10:08:09 - mmengine - INFO - Iter(train) [ 200/320000] lr: 1.3276e-05 eta: 13 days, 15:44:23 time: 3.6894 data_time: 0.0102 memory: 14065 loss: 2.4008 dec.los.nfl_loss: 2.4008 dec.met.binary_iou: 0.9027 2024/06/14 10:20:15 - mmengine - INFO - Iter(train) [ 400/320000] lr: 2.6618e-05 eta: 13 days, 13:04:21 time: 3.6616 data_time: 0.0102 memory: 7740 loss: 1.4424 dec.los.nfl_loss: 1.4424 dec.met.binary_iou: 0.7158 2024/06/14 10:32:25 - mmengine - INFO - Iter(train) [ 600/320000] lr: 3.9960e-05 eta: 13 days, 12:25:21 time: 3.6566 data_time: 0.0102 memory: 7740 loss: 1.0784 dec.los.nfl_loss: 1.0784 dec.met.binary_iou: 0.9195 2024/06/14 10:44:38 - mmengine - INFO - Iter(train) [ 800/320000] lr: 5.3302e-05 eta: 13 days, 12:28:17 time: 3.6595 data_time: 0.0102 memory: 7740 loss: 0.8912 dec.los.nfl_loss: 0.8912 dec.met.binary_iou: 0.7826 2024/06/14 10:56:56 - mmengine - INFO - Exp name: train_colaug_coco_lvis_1024x1024_320k_20240614_095248 2024/06/14 10:56:56 - mmengine - INFO - Iter(train) [ 1000/320000] lr: 6.6644e-05 eta: 13 days, 12:46:51 time: 3.6652 data_time: 0.0101 memory: 7740 loss: 0.7754 dec.los.nfl_loss: 0.7754 dec.met.binary_iou: 0.7738 2024/06/14 11:09:15 - mmengine - INFO - Iter(train) [ 1200/320000] lr: 7.9987e-05 eta: 13 days, 12:59:27 time: 3.6660 data_time: 0.0102 memory: 7739 loss: 0.3575 dec.los.nfl_loss: 0.3575 dec.met.binary_iou: 0.4644 2024/06/14 11:21:33 - mmengine - INFO - Iter(train) [ 1400/320000] lr: 9.3329e-05 eta: 13 days, 13:04:49 time: 3.6779 data_time: 0.0101 memory: 7740 loss: 0.3251 dec.los.nfl_loss: 0.3251 dec.met.binary_iou: 0.8465 2024/06/14 11:33:54 - mmengine - INFO - Iter(train) [ 1600/320000] lr: 9.9969e-05 eta: 13 days, 13:13:53 time: 3.6896 data_time: 0.0101 memory: 7740 loss: 0.3183 dec.los.nfl_loss: 0.3183 dec.met.binary_iou: 0.3718
So... will this takes me 13days to train..?
I wonder where the pre-extracted embeddings are stored. 'focsam/data/embeds/colaug_coco_1024x1024_sam_vit_huge' or 'focsam/work_dirs/sam/coco_lvis/train_colaug_coco_lvis_1024x1024_320k'? 'focsam/data/embeds/colaug_coco_1024x1024_sam_vit_huge' is empty. 'focsam/work_dirs/sam/coco_lvis/train_colaug_coco_lvis_1024x1024_320k' only store config.py and logs.
It appears each training iteration takes over 3 seconds, leading to the estimated training duration. The pre-extracted embeddings are dynamically generated during the training process and are typically stored in focsam/data/embeds/colaug_coco_1024x1024_sam_vit_huge
. Please ensure that the path under data
rather than work_dirs
is used since the work_dirs
directory is intended for logs and model weights. As the total size of the embeddings can reach up to 300-400 GB, they are not uploaded but generated on-the-fly using SAM, which should significantly speed up the training once cached. Make sure there is sufficient disk space available to accommodate this.
It seems that the pre-extracted embeddings are not stored. Could you tell me which file which file implements the embedding storage function? So I can check what's wrong with it.
There is a warning 06/14 14:12:42 - mmengine - WARNING - Failed to search registry with scope "mmseg" in the "embed_loader" registry tree. As a workaround, the current "embed_loader" registry in "focsam" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmseg" is a correct scope, or whether the registry is initialized.
, so maybe this is the reason?
It seems that the pre-extracted embeddings are not stored. Could you tell me which file which file implements the embedding storage function? So I can check what's wrong with it.
You'll notice in your training configuration that there's a variable named image_embed_loader with the type BaseEmbedLoader. To find the implementation, you can use the global search function in your IDE, such as PyCharm. Simply select BaseEmbedLoader and double-tap Shift, which should direct you to the corresponding .py file, likely located in focsam/embed_loaders/base.py. Most IDEs offer a similar feature for locating files. This file contains the caching implementation you're looking for. As this repository is developed based on the OpenMMLab codebase, you'll typically need to trace the configuration settings in this manner to find the relevant code.
There is a warning
06/14 14:12:42 - mmengine - WARNING - Failed to search registry with scope "mmseg" in the "embed_loader" registry tree. As a workaround, the current "embed_loader" registry in "focsam" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmseg" is a correct scope, or whether the registry is initialized.
, so maybe this is the reason?
It seems likely that the issue stems from the registry problem mentioned in the warning. I will adjust the registry settings.
There is a warning
06/14 14:12:42 - mmengine - WARNING - Failed to search registry with scope "mmseg" in the "embed_loader" registry tree. As a workaround, the current "embed_loader" registry in "focsam" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmseg" is a correct scope, or whether the registry is initialized.
, so maybe this is the reason?It seems likely that the issue stems from the registry problem mentioned in the warning. I will adjust the registry settings.
I've just updated the registration for the embed_loader. This should align the scope correctly and hopefully resolve the issues.
There is a file needed naming data/embeds/colaug_coco_1024x1024_sam_vit_huge/meta-info.json
. Is this a default file or generated?
There is no code storing the caches. The "embed_loader" is only loading the caches to the decoder. But I need to store the embeddings from the encoder.
You might want to check the implementation starting from line 42 in focsam/embed_loaders/base.py
, where the __call__
method handles the primary functionality. This method checks for existing cached embeddings and, if none are found, it generates and stores them, including constructing the meta-info file.
well...
Here is the code that saves the embeddings:
if update_flag: np.save( self.prefix_to_embed_file(prefix), embed.cpu().numpy())
and here is the value of update_flag
:
update_flag = self.update_prefixes_each_step \ and (max_num_prefixes < 0 or len(self.prefixes) < max_num_prefixes)
But update_prefixed_each_step is set to be False in focsam/configs/sam/coco_lvis/train_colaug_coco_lvis_1024x1024_320k.py
, so update_flag is always false, and the embedding can never be saved. I set update_prefixed_each_step to True, and the embeddings are saved. Is this the right solution?
Indeed, setting update_prefixed_each_step
to True
is necessary. Feel free to test this adjustment. I have also updated the configuration file accordingly. I noticed that in a similar configuration, configs/focsam/coco_lvis/train_colaug_coco_lvis_1024x1024_160k.py
, this setting was already set to True. I apologize for any inconvenience.
When I am trying to train by running the 'train.py', there is an error "'COCOPanopticDataset' object has no attribute 'ignore_sample_indices' ". I found that this is because there needs to be files 'data/meta-info/coco_invalid_image_names.json' and 'data/meta-info/coco.json'. But there is no such files under my 'data/meta-info'. So these 2 files are default setting files or generated by 'train.py'? Why cannot my 'train.py' generate these files? I am sure that the dataset COCO and LVIS are arranged according to the 'DATASET.md'.