Can't apply DataParallel

zhoumumu commented 3 years ago

It would run into "double free" bug like below if I train in parallel mode.

free(): double free detected in tcache 2 Aborted (core dumped)

Training in one GPU is kind of slow, how does the bug come?

antoninofurnari commented 3 years ago

Hello,

Did you modify the code to allow for multi-gpu training? If so, can you provide the modified code?

From a quick search, this issue seem to be related to Pytorch/your overall configuration, rather than the code hosted in this repository (https://github.com/pytorch/pytorch/issues/33661).

Antonino

zhoumumu commented 3 years ago

Hi, I've changed to another environment, with python3, torch1.5 &. cuda10.1. And I do get rid of the "double free".

By the way, just a little note here, you'd better modify the dataset.py for better IO efficiency if you want to run it in DataParallel mode. Specifically, preload all features in memory rather than fetch them during getitem(). Or you might run into the situation of 0% gpu-util.

Really appreciate your reply.

antoninofurnari commented 3 years ago

Hi, I’m glad you solved the issue and thank you for your suggestion about pre-loading features.

The reason why the current implementation does not do that, is to minimize the amount of RAM needed to run the training. Anyway, it would make sense to add a flag so that people can choose which of the two schemes to use and I believe pre-loading features can be a significant speed-up when features are stored in slow disks.

If you end up modifying the code that way, feel free to send a pull request!

Best, Antonino

zhoumumu commented 3 years ago

Hi, I've sent the pull request.

And I got a new problem about reproducing result of one of the team in the anticipate challenge this year. It's the class-balanced loss and DRW from 2nd place's team. I've read the challege report, I think it's really easy to implement and i do get promotion based on simplest LSTM model. However I find it not work based on RULSTM. I was thinking is that I made something wrong? So is that convenient for you to give me their email? I'd like to ask for their advice and check the code details. And that would help a lot.

Best, Zhoumumu

------------------ 原始邮件 ------------------ 发件人: "fpv-iplab/rulstm" @.>; 发送时间: 2021年8月26日(星期四) 下午3:38 @.>; @.>;"State @.>; 主题: Re: [fpv-iplab/rulstm] Can't apply DataParallel (#15)

Hi, I’m glad you solved the issue and thank you for your suggestion about pre-loading features.

The reason why the current implementation does not do that, is to minimize the amount of RAM needed to run the training. Anyway, it would make sense to add a flag so that people can choose which of the two schemes to use and I believe pre-loading features can be a significant speed-up when features are stored in slow disks.

If you end up modifying the code that way, feel free to send a pull request!

Best, Antonino

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

antoninofurnari commented 3 years ago

Many thanks for the pull request! It might take some time for me to review it and I'll get back in case I have any doubts.

Regarding the contacts of the 2nd place winners of this year's EK challenge, maybe you refer to the method described in the technical report (https://epic-kitchens.github.io/Reports/EPIC-KITCHENS-Challenges-2021-Report.pdf) at page 37? In that case, you can find all the email addresses in the report.

If you end-up being able to replicate the results (or even just to improve the current ones), feel free to get back in touch or create another pull request, so we can update the code for the benefit of others - this is very much appreciated! Only keep in mind that it is best if we add new functionalities as "opt-in" parameters/flags, so that, by default, the standard (legacy) behavior is preserved.

Thanks again for your interest and support!

zhoumumu commented 3 years ago

Oh I noticed it！ Such a newbie I am about the email!

Thank you for you replies! And I'll keep in touch if i make the progress.

------------------ 原始邮件 ------------------ 发件人: "fpv-iplab/rulstm" @.>; 发送时间: 2021年8月30日(星期一) 下午2:42 @.>; @.>;"State @.>; 主题: Re: [fpv-iplab/rulstm] Can't apply DataParallel (#15)

Many thanks for the pull request! It might take some time for me to review it and I'll get back in case I have any doubts.

Regarding the contacts of the 2nd place winners of this year's EK challenge, maybe you refer to the method described in the technical report (https://epic-kitchens.github.io/Reports/EPIC-KITCHENS-Challenges-2021-Report.pdf) at page 37? In that case, you can find all the email addresses in the report.

If you end-up being able to replicate the results (or even just to improve the current ones), feel free to get back in touch or create another pull request, so we can update the code for the benefit of others - this is very much appreciated! Only keep in mind that it is best if we add new functionalities as "opt-in" parameters/flags, so that, by default, the standard (legacy) behavior is preserved.

Thanks again for your interest and support!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

fpv-iplab / rulstm

Can't apply DataParallel #15