which config file is used?

zhanghaoo commented 4 years ago

I konw that these config files are uesd as follows on the condition that gpu is 1: BASE_RCNN_1gpu.yaml and vid_R_101_C4_MEGA_1x.yaml.

But where can i modify the parameter , NUM_WORKERS of DATALOADER? It is useless when i change that parameter in defaults.py.However, i can't any yamls which contain that parameter.

Who can tell me?

Thank u very much!

zhanghaoo commented 4 years ago

@zhanghaoo i can't find any yamls which contain that parameter.

zhanghaoo commented 4 years ago

Besides, i think this parameter,NUM_WORKERS ,should be modified according to the number of gpu.In other words , NUM_WORKERS should be 1 when the number of gpu is 1.

What made me think of that? Because after the marked words "start training" appears, the computer has no prompt for the next training，which made me think that the computer was dead！

what can i do!!!!!!

launchauto commented 4 years ago

You need not change the num_workers, I try 1 Tesla V32gpu 4Tesla V32 gpu and 8 Tesla V32 gpu training and keep the num_workers=4. Just wait. The first time to get the annotation to the cache may take some times.

BTW 4GPU V32 can get the mAP as author posted in paper, 8GPU V32 the mAP may be a little lower.(0.002 base learning rate, 6w iterations)

zhanghaoo commented 4 years ago

Hello, I'm sorry the server is running other programs in the past few days. I cannot demonstrate the problem I saw it that night after you responded to me

Now the program is running train_net.py. As shown in the attached picture.

Questions are as follows:

It's really time-consuming, but what I don't understand is why the GPU utilization is often 0 after printing "start training"? What is the program doing?
I added the red arrow print command. Why doesn't the program print this line? This is to directly execute the green arrow command and print "start training"?

Part of the configuration is as follows: 1.NIVIDIA GeForce GTX 1080Ti 2.Cuda:10.1.243 3.Cudnn: 7.6.5

Thank you very much for your answers!

---Original--- From: "LauncH"<notifications@github.com> Date: Mon, Aug 31, 2020 19:20 PM To: "Scalsol/mega.pytorch"<mega.pytorch@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"zhanghaoo"<296495427@qq.com>; Subject: Re: [Scalsol/mega.pytorch] which config file is used? (#53)

You need not change the num_workers, I try 1 Tesla V32gpu 4Tesla V32 gpu and 8 Tesla V32 gpu training and keep the num_workers=4. Just wait. The first time to get the annotation to the cache may take some times.

BTW 4GPU V32 can get the mAP as author posted in paper, 8GPU V32 the mAP may be a little lower.(0.002 base learning rate, 6w iterations)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

launchauto commented 4 years ago

Hello, I'm sorry the server is running other programs in the past few days. I cannot demonstrate the problem I saw it that night after you responded to me Now the program is running train_net.py. As shown in the attached picture. Questions are as follows: 1. It's really time-consuming, but what I don't understand is why the GPU utilization is often 0 after printing "start training"? What is the program doing? 2. I added the red arrow print command. Why doesn't the program print this line? This is to directly execute the green arrow command and print "start training"? Part of the configuration is as follows: 1.NIVIDIA GeForce GTX 1080Ti 2.Cuda:10.1.243 3.Cudnn: 7.6.5 Thank you very much for your answers! … ---Original--- From: "LauncH"<notifications@github.com> Date: Mon, Aug 31, 2020 19:20 PM To: "Scalsol/mega.pytorch"<mega.pytorch@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"zhanghaoo"<296495427@qq.com>; Subject: Re: [Scalsol/mega.pytorch] which config file is used? (#53) You need not change the num_workers, I try 1 Tesla V32gpu 4Tesla V32 gpu and 8 Tesla V32 gpu training and keep the num_workers=4. Just wait. The first time to get the annotation to the cache may take some times. BTW 4GPU V32 can get the mAP as author posted in paper, 8GPU V32 the mAP may be a little lower.(0.002 base learning rate, 6w iterations) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

for question1, check your log.txt and find how many GPUs you have used. If u have used all of them, the utilization of a GPU may be more than 80%. The task is time-consuming, I use 4 Tesla V32 GPU to train for nearly a day and test the whole validation datasets for nearly 2 hours, finally get the result as the author posted in his paper. No doubt that you may take much more time because you use NVIDIA 1080Ti. for question2, I do not see your attached picture. What does it mean? Can u speak Chinese?

launchauto commented 4 years ago

Hello, I'm sorry the server is running other programs in the past few days. I cannot demonstrate the problem I saw it that night after you responded to me Now the program is running train_net.py. As shown in the attached picture. Questions are as follows: 1. It's really time-consuming, but what I don't understand is why the GPU utilization is often 0 after printing "start training"? What is the program doing? 2. I added the red arrow print command. Why doesn't the program print this line? This is to directly execute the green arrow command and print "start training"? Part of the configuration is as follows: 1.NIVIDIA GeForce GTX 1080Ti 2.Cuda:10.1.243 3.Cudnn: 7.6.5 Thank you very much for your answers! … ---Original--- From: "LauncH"[notifications@github.com](mailto:notifications@github.com) Date: Mon, Aug 31, 2020 19:20 PM To: "Scalsol/mega.pytorch"[mega.pytorch@noreply.github.com](mailto:mega.pytorch@noreply.github.com); Cc: "Mention"[mention@noreply.github.com](mailto:mention@noreply.github.com);"zhanghaoo"[296495427@qq.com](mailto:296495427@qq.com); Subject: Re: [Scalsol/mega.pytorch] which config file is used? (#53) You need not change the num_workers, I try 1 Tesla V32gpu 4Tesla V32 gpu and 8 Tesla V32 gpu training and keep the num_workers=4. Just wait. The first time to get the annotation to the cache may take some times. BTW 4GPU V32 can get the mAP as author posted in paper, 8GPU V32 the mAP may be a little lower.(0.002 base learning rate, 6w iterations) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

for question1, check your log.txt and find how many GPUs you have used. If u have used all of them, the utilization of a GPU may be more than 80%. The task is time-consuming, I use 4 Tesla V32 GPU to train for nearly a day and test the whole validation datasets for nearly 2 hours, finally get the result as the author posted in his paper. No doubt that you may take much more time because you use NVIDIA 1080Ti. for question2, I do not see your attached picture. What does it mean? Can u speak Chinese?

The training set holds 109815 pictures-53621 for DET and 56194 for VID. Maybe you can just train the VID dataset and get lower accuracy(around 76.8%mAP). The validation dataset holds 176126 pictures. The whole iteration for 4 GPU MEGA traning is 120K，and batch size is 4. So the epoch is 120000*4/109815, approximately 4.37.

zhanghaoo commented 4 years ago

不好意思我可能没有表述清楚，我是通过邮件发送的，附件在邮件里，我在issue重新说明一下。

1.GPU使用情况 log.txt说明如下： 2020-09-01 21:52:38,942 mega_core INFO: Using 1 GPUs 2020-09-01 21:52:38,942 mega_core INFO: Namespace(config_file='configs/MEGA/vid_R_101_C4_MEGA_1x.yaml', distributed=False, launcher='pytorch', local_rank=0, master_port='27341', motion_specific=True, opts=['OUTPUT_DIR', 'training_dir/MEGA_R_101_1x'], save_name='', skip_test=False)

2、关于附件问题图片：

    开始训练并且提示语“start training”出现之后程序就没有反应了，且GPU利用率低，我想知道是哪里出错了并且到底有没有开始训练，我进行了调试。
    调试过程中发现一个问题，我在trainer.py中语句（图片**绿色**剪头所指）logger.info("Start training")前加了命令行（**红色**箭头所指）print("this command will not be printed.")，这行命令无效，并不会被打印。
    不懂这是为什么。

3、关于训练数据集当中的图片数量及索引txt文件包含的需要训练的图片个数我都有更改，现在只是想训练、测试两个模块可以顺利跑通，用您的方法跑一下我自己的数据集。可是跑不通。

（PS：谢谢您这么耐心解答小白问题，非常谢谢，祝您开心！！！）

zhanghaoo commented 4 years ago

数据集当中的图片数量及索引txt文件包含的需要训练的图片个数我都有更改，训练只用了： train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 10 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 30 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 50 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 70 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 90 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 110 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 130 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 150 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 170 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 190 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 210 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 230 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 250 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 270 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00000000 1 290 300 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 1 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 4 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 8 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 11 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 14 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 17 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 20 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 24 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 27 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 30 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 33 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 36 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 40 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 43 48 train/ILSVRC2015_VID_train_0000/ILSVRC2015_train_00001000 1 46 48

只用了上述30张图，现在只是想训练、测试两个模块可以顺利跑通，用您的方法跑一下我自己的数据集。可是跑不通。

launchauto commented 4 years ago

不好意思我可能没有表述清楚，我是通过邮件发送的，附件在邮件里，我在issue重新说明一下。

1.GPU使用情况 log.txt说明如下： 2020-09-01 21:52:38,942 mega_core INFO: Using 1 GPUs 2020-09-01 21:52:38,942 mega_core INFO: Namespace(config_file='configs/MEGA/vid_R_101_C4_MEGA_1x.yaml', distributed=False, launcher='pytorch', local_rank=0, master_port='27341', motion_specific=True, opts=['OUTPUT_DIR', 'training_dir/MEGA_R_101_1x'], save_name='', skip_test=False)

2、关于附件问题图片：
    开始训练并且提示语“start training”出现之后程序就没有反应了，且GPU利用率低，我想知道是哪里出错了并且到底有没有开始训练，我进行了调试。
    调试过程中发现一个问题，我在trainer.py中语句（图片**绿色**剪头所指）logger.info("Start training")前加了命令行（**红色**箭头所指）print("this command will not be printed.")，这行命令无效，并不会被打印。
    不懂这是为什么。
3、关于训练数据集当中的图片数量及索引txt文件包含的需要训练的图片个数我都有更改，现在只是想训练、测试两个模块可以顺利跑通，用您的方法跑一下我自己的数据集。可是跑不通。

（PS：谢谢您这么耐心解答小白问题，非常谢谢，祝您开心！！！）

我不是作者。我就只复现了原作者的结果。你应该没开始训练，开始训练了会打印输出迭代次数和Loss，每2500次存一个checkpoint。你试试把gpu现在占用的不要用的进程全部kill关了，再重新执行程序。打印不出你可以百度查一下python print不能立即打印的问题，我感觉是不是缓冲区的原因。你试着换成Logger.info输出,应该是可以输出的吧。

zhanghaoo commented 4 years ago

强。

唉不懂，按您的提示我分别使用了： 1、print("this command will not be printed .",flush=True) 2、logger.info("this command will not be printed")

都不会打印出来，真的很烦，有点无从下手的感觉。开始训练执行do_train函数，也就是进入trainer.py之后，就好像程序不是读取的这个py文件一样，不管怎样修改都只是打印“start training”，然后就好像进入了死循环一样出不来了，GPU利用率在0、5这两个数跳跃。

我认为我前边的步骤执行的都是正确的，这是我看到的最好最清晰不过的开源代码，可是不知道为什么我自己实现起来有好多问题。

目前为止： 1、环境按步骤搭建，可以保证是没有错的 2、数据集路径、格式放置正确，只是修改了图片数量 3、开始训练命令行输入正确，使用1个GPU

可是就是不训练。

您有空的时候再回我吧我自己在看一看，我觉得MEGA很适合现在的工作，不想放弃它。谢谢您！

ZhijunHou commented 3 years ago

不好意思我可能没有表述清楚，我是通过邮件发送的，附件在邮件里，我在issue重新说明一下。 1.GPU使用情况 log.txt说明如下： 2020-09-01 21:52:38,942 mega_core INFO: Using 1 GPUs 2020-09-01 21:52:38,942 mega_core INFO: Namespace(config_file='configs/MEGA/vid_R_101_C4_MEGA_1x.yaml', distributed=False, launcher='pytorch', local_rank=0, master_port='27341', motion_specific=True, opts=['OUTPUT_DIR', 'training_dir/MEGA_R_101_1x'], save_name='', skip_test=False) 2、关于附件问题图片：
    开始训练并且提示语“start training”出现之后程序就没有反应了，且GPU利用率低，我想知道是哪里出错了并且到底有没有开始训练，我进行了调试。
    调试过程中发现一个问题，我在trainer.py中语句（图片**绿色**剪头所指）logger.info("Start training")前加了命令行（**红色**箭头所指）print("this command will not be printed.")，这行命令无效，并不会被打印。
    不懂这是为什么。
3、关于训练数据集当中的图片数量及索引txt文件包含的需要训练的图片个数我都有更改，现在只是想训练、测试两个模块可以顺利跑通，用您的方法跑一下我自己的数据集。可是跑不通。（PS：谢谢您这么耐心解答小白问题，非常谢谢，祝您开心！！！）
我不是作者。我就只复现了原作者的结果。你应该没开始训练，开始训练了会打印输出迭代次数和Loss，每2500次存一个checkpoint。你试试把gpu现在占用的不要用的进程全部kill关了，再重新执行程序。打印不出你可以百度查一下python print不能立即打印的问题，我感觉是不是缓冲区的原因。你试着换成Logger.info输出,应该是可以输出的吧。

老哥,你复现成功了吗?能提供一下参数设置和你电脑的配置吗? 我的acc一直在79左右,没有原paper里的高

liwenjielongren commented 3 years ago

您好，请问batchsize在哪里改的，我改了之后还是需要很大内存，不清楚是不是改错了地方。

ZhijunHou commented 3 years ago

你可以在对应的config file里面加，或者在base_RCNN_xgpu里面改，根据你的gpu的个数找到对应的config file，参数名是imgs_per_batch。应该就是这个

On Tue, Dec 1, 2020 at 9:22 PM liwenjielongren notifications@github.com wrote:

您好，请问batchsize在哪里改的，我改了之后还是需要很大内存，不清楚是不是改错了地方。

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Scalsol/mega.pytorch/issues/53#issuecomment-736944766, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBCJD4RHKTDE4U62HAGQYDSSWQF3ANCNFSM4QP7KSRQ .

liwenjielongren commented 3 years ago

好的，谢谢

ZhijunHou commented 3 years ago

一起学习，多交流(^^)

On Wed, Dec 2, 2020 at 08:10 liwenjielongren notifications@github.com wrote:

好的，谢谢

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Scalsol/mega.pytorch/issues/53#issuecomment-737220625, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBCJD6C5MS5OQNZPAU7L3TSSY4DVANCNFSM4QP7KSRQ .

zhanghaoo commented 3 years ago

我又回来了，这阵子忙完了要继续做这个了。 @ZhijunHou @liwenjielongren

兄弟们你们都跑成功了吗？

ZhijunHou commented 3 years ago

跑成功啦，就是acc不太行，你是哪里报错啦 On Wed, Jan 27, 2021 at 04:04 zhanghaoo notifications@github.com wrote:

我又回来了，这阵子忙完了要继续做这个了。 @ZhijunHou https://github.com/ZhijunHou @liwenjielongren https://github.com/liwenjielongren

兄弟们你们都跑成功了吗？

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/Scalsol/mega.pytorch/issues/53#issuecomment-768140866, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBCJDYMOIJK7J3QX4QY5WLS37JJVANCNFSM4QP7KSRQ .

zhanghaoo commented 3 years ago

很奇怪　我这个提示start training之后就卡住不动了　我觉得还是gpu读取出了问题

我在make_data_loader时，num_gpus = get_world_size()读不出来num_gpus，但是也不报错

铁子我邮箱zh_pure@sina.com，请教你一些问题，需要指导一波

突然有点事了我得先回宿舍了，期待我军的联系！

zhanghaoo commented 3 years ago

解决了。

跑通了。

自己的数据集也跑了。

剩下的就是分析为什么不太好的原因了。

launchauto commented 3 years ago

跑成功啦，就是acc不太行，你是哪里报错啦 … On Wed, Jan 27, 2021 at 04:04 zhanghaoo @.***> wrote: 我又回来了，这阵子忙完了要继续做这个了。 @ZhijunHou https://github.com/ZhijunHou @liwenjielongren https://github.com/liwenjielongren 兄弟们你们都跑成功了吗？ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBCJDYMOIJK7J3QX4QY5WLS37JJVANCNFSM4QP7KSRQ .

跑了一下可以跑通，自己训练或者直接加载作者训练好的模型也可以复现作者的结果。rdn和mega都可以。在自己的数据集上跑能够好一点，但涨点不多。Imagenet vid太简单了。

launchauto commented 3 years ago

跑成功啦，就是acc不太行，你是哪里报错啦 … On Wed, Jan 27, 2021 at 04:04 zhanghaoo @.***> wrote: 我又回来了，这阵子忙完了要继续做这个了。 @ZhijunHou https://github.com/ZhijunHou @liwenjielongren https://github.com/liwenjielongren 兄弟们你们都跑成功了吗？ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBCJDYMOIJK7J3QX4QY5WLS37JJVANCNFSM4QP7KSRQ .

跑了一下可以跑通，自己训练或者直接加载作者训练好的模型也可以复现作者的结果。rdn和mega都可以。在自己的数据集上跑能够好一点，但涨点不多。Imagenet vid太简单了。

你配置环境的时按照作者的配置进行吗修改了什么吗？

就是按照Install.md配置的。ubuntu16.04, cuda9.2, pytorch1.3.0+cu92, torchvision 0.4.1+cu92, python3.7, 4卡或者8卡Tesla V100都行。maskbenchmark这个框架facebook不更新了。用高版本的pytorch1.4以上可能会有问题。

launchauto commented 3 years ago

跑成功啦，就是acc不太行，你是哪里报错啦 … On Wed, Jan 27, 2021 at 04:04 zhanghaoo @.***> wrote: 我又回来了，这阵子忙完了要继续做这个了。 @ZhijunHou https://github.com/ZhijunHou @liwenjielongren https://github.com/liwenjielongren 兄弟们你们都跑成功了吗？ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBCJDYMOIJK7J3QX4QY5WLS37JJVANCNFSM4QP7KSRQ .

acc太低会不会可能是backbone的预训练模型离线加载没加载对。./mega_core/config/paths_catalog.py有显示各个预训练模型的路径。R101.pkl是detectron 1 msra版本的预训练模型，是c4的不是fpn的。如果是在线加载的应该不是这个问题 maskbenchmark不像mmdetection, 预训练模型没加载对还会报错哪些层参数缺失，maskbenchmark直接不报。

zhanghaoo commented 3 years ago

@launchauto 兄弟　可以分享一波网络架构吗？就是网络结构图。我这有问题了　我对这个理解出现了一定问题方便联系我吗？我的邮箱在上边

zhanghaoo commented 3 years ago

@ZhijunHou 可不可以联系我邮箱讨论一下这个MEGA的问题bro

ZhijunHou commented 3 years ago

啥问题啊，老铁🤣。

On Thu, Apr 15, 2021 at 8:16 AM zhanghaoo @.***> wrote:

@ZhijunHou https://github.com/ZhijunHou 可不可以联系我邮箱讨论一下这个MEGA的问题bro

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Scalsol/mega.pytorch/issues/53#issuecomment-820376871, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBCJD7DTCBQWND4MNUEF7DTI3KKPANCNFSM4QP7KSRQ .

zhanghaoo commented 3 years ago

@ZhijunHou 好兄弟不知道你有没有具体看MEGA引用的Relation model这部分。

你知道relation model里边这个Wv线性变换权重到底是什么样子吗微信截图_20210416152742

zhanghaoo commented 3 years ago

@ZhijunHou 还有，bro，比如说测试集有20张，但假设聚合的帧数按公式为Tm*Nl+Tg=40，这多出来的20，是如何处理的？？ @launchauto 好兄弟你复现的时候是怎么处理的可以告诉我吗

zhanghaoo commented 3 years ago

@launchauto T_T T_T T_T T_T T_T

zhanghaoo commented 3 years ago

好兄弟们，bro! @launchauto @ZhijunHou @liwenjielongren @joe660

zhanghaoo commented 3 years ago

@joe660 ？？？

这......好家伙你问我啥了啊我从来都没看到过啊

zhanghaoo commented 3 years ago

ni你可真是个小天才。

不过我估计没人加......

qq群号：728816033 二维码如下： MEGA

ZhijunHou commented 3 years ago

怎么感觉这个东西最近又火起来了？知乎也有人问我问题…

On Mon, Apr 19, 2021 at 10:19 PM joe660 @.***> wrote:

ni你可真是个小天才。

不过我估计没人加......

qq群号：728816033 二维码如下： [image: MEGA] https://user-images.githubusercontent.com/33448536/115327370-67f80680-a1c1-11eb-8ab4-2aad0dc1cb36.png

让上面那几个大佬进群哈哈你们之前还在上面讨论问题呢？

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Scalsol/mega.pytorch/issues/53#issuecomment-822921414, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBCJD4IIYDDOUB3YTOLTYTTJTQEHANCNFSM4QP7KSRQ .

joe660 commented 3 years ago

怎么感觉这个东西最近又火起来了？知乎也有人问我问题… … On Mon, Apr 19, 2021 at 10:19 PM joe660 @.***> wrote: ni你可真是个小天才。不过我估计没人加...... qq群号：728816033 二维码如下： [image: MEGA] https://user-images.githubusercontent.com/33448536/115327370-67f80680-a1c1-11eb-8ab4-2aad0dc1cb36.png 让上面那几个大佬进群哈哈你们之前还在上面讨论问题呢？ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBCJD4IIYDDOUB3YTOLTYTTJTQEHANCNFSM4QP7KSRQ .

你是浙大的吗？

asmallcat commented 3 years ago

不好意思我可能没有表述清楚，我是通过邮件发送的，附件在邮件里，我在issue重新说明一下。 1.GPU使用情况 log.txt说明如下： 2020-09-01 21:52:38,942 mega_core INFO: Using 1 GPUs 2020-09-01 21:52:38,942 mega_core INFO: Namespace(config_file='configs/MEGA/vid_R_101_C4_MEGA_1x.yaml', distributed=False, launcher='pytorch', local_rank=0, master_port='27341', motion_specific=True, opts=['OUTPUT_DIR', 'training_dir/MEGA_R_101_1x'], save_name='', skip_test=False) 2、关于附件问题图片：
    开始训练并且提示语“start training”出现之后程序就没有反应了，且GPU利用率低，我想知道是哪里出错了并且到底有没有开始训练，我进行了调试。
    调试过程中发现一个问题，我在trainer.py中语句（图片**绿色**剪头所指）logger.info("Start training")前加了命令行（**红色**箭头所指）print("this command will not be printed.")，这行命令无效，并不会被打印。
    不懂这是为什么。
3、关于训练数据集当中的图片数量及索引txt文件包含的需要训练的图片个数我都有更改，现在只是想训练、测试两个模块可以顺利跑通，用您的方法跑一下我自己的数据集。可是跑不通。（PS：谢谢您这么耐心解答小白问题，非常谢谢，祝您开心！！！）
我不是作者。我就只复现了原作者的结果。你应该没开始训练，开始训练了会打印输出迭代次数和Loss，每2500次存一个checkpoint。你试试把gpu现在占用的不要用的进程全部kill关了，再重新执行程序。打印不出你可以百度查一下python print不能立即打印的问题，我感觉是不是缓冲区的原因。你试着换成Logger.info输出,应该是可以输出的吧。
老哥,你复现成功了吗?能提供一下参数设置和你电脑的配置吗? 我的acc一直在79左右,没有原paper里的高

你分析出是啥原因造成得吗？我也是在一个GPU上复现，用的官方得配置文件和数据集，在resnet101网络下也就是78%多一点。和原论文表现差很多，这是啥原因呢？

Flyingdog-Huang commented 3 years ago

python demo/demo.py mega configs/MEGA/vid_R_101_C4_MEGA_1x.yaml configs/MEGA/MEGA_R_101.pth --video --visualize-path datasets/vid/1.mp4 --output-folder visualization/1_MEGA 1%|▉ | 23/1798 [00:23<30:20, 1.03s/it] Traceback (most recent call last): File "demo/demo.py", line 69, in visualization_results = vid_demo.run_on_video(args.visualize_path) File "/home/hlx/objectDetection/ht/project/mega.pytorch/demo/predictor.py", line 497, in run_on_video results = self.run_on_image_folder(tmpdir, suffix='.jpg') File "/home/hlx/objectDetection/ht/project/mega.pytorch/demo/predictor.py", line 484, in run_on_image_folder image_with_boxes = self.run_on_image(original_image, infos) File "/home/hlx/objectDetection/ht/project/mega.pytorch/demo/predictor.py", line 511, in run_on_image predictions = self.compute_prediction(image, infos) File "/home/hlx/objectDetection/ht/project/mega.pytorch/demo/predictor.py", line 531, in compute_prediction predictions = self.model(infos) File "/home/hlx/anaconda3/envs/MEGA/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, kwargs) File "/home/hlx/objectDetection/ht/project/mega.pytorch/mega_core/modeling/detector/generalized_rcnn_mega.py", line 78, in forward return self._forward_test(images["cur"], infos) File "/home/hlx/objectDetection/ht/project/mega.pytorch/mega_core/modeling/detector/generalized_rcnn_mega.py", line 221, in _forward_test x, result, detector_losses = self.roi_heads(feats, proposals_list, None) File "/home/hlx/anaconda3/envs/MEGA/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, *kwargs) File "/home/hlx/objectDetection/ht/project/mega.pytorch/mega_core/modeling/roi_heads/roi_heads.py", line 26, in forward x, detections, loss_box = self.box(features, proposals, targets) File "/home/hlx/anaconda3/envs/MEGA/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/home/hlx/objectDetection/ht/project/mega.pytorch/mega_core/modeling/roi_heads/box_head/box_head.py", line 96, in forward x = self.feature_extractor(features, proposals) File "/home/hlx/anaconda3/envs/MEGA/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/home/hlx/objectDetection/ht/project/mega.pytorch/mega_core/modeling/roi_heads/box_head/roi_box_feature_extractors.py", line 655, in forward return self._forward_test(x, proposals) File "/home/hlx/objectDetection/ht/project/mega.pytorch/mega_core/modeling/roi_heads/box_head/roi_box_feature_extractors.py", line 919, in _forward_test feat_cur = self._forward_test_single(i, self.local_cache[i], memory) File "/home/hlx/objectDetection/ht/project/mega.pytorch/mega_core/modeling/roi_heads/box_head/roi_box_feature_extractors.py", line 817, in _forward_test_single position_embedding = self.cal_position_embedding(rois_cur, rois_ref) File "/home/hlx/objectDetection/ht/project/mega.pytorch/mega_core/modeling/roi_heads/box_head/roi_box_feature_extractors.py", line 244, in cal_position_embedding position_embedding = self.extract_position_embedding(position_matrix, feat_dim=64) File "/home/hlx/objectDetection/ht/project/mega.pytorch/mega_core/modeling/roi_heads/box_head/roi_box_feature_extractors.py", line 140, in extract_position_embedding embedding = torch.cat([sin_mat, cos_mat], dim=3) RuntimeError: CUDA out of memory. Tried to allocate 594.00 MiB (GPU 0; 4.94 GiB total capacity; 2.43 GiB already allocated; 586.25 MiB free; 936.28 MiB cached)

请问有老哥遇到这个问题了吗 @launchauto @ZhijunHou @asmallcat @zhanghaoo @liwenjielongren

meikorol commented 1 year ago

请问下自己做数据集，文件摆放是怎么样的，记录帧数的txt文件又是怎么生成的？我下载了VID数据集发现里面还有对应的视频，这里需要么

Scalsol / mega.pytorch

which config file is used? #53