What GPU are you using - Githubissues

JayAust commented 4 years ago

Thank you very much for publishing your source code. When I tried to run your code, I found that demo.py can be run smoothly according to your readme.md, but test.py and train.py cannot be run. Could you please tell me the GPU you are using? The platform I am using is RTX2060S. I suspect that the reason for the inability to train and test is that the memory of my GPU cannot meet the requirements of the model.

CalayZhou commented 4 years ago

I have tested the code on a single RTX1080Ti or RTX2080Ti , the model can run smoothly under both GPUs. the demo.py and the test.py are almost the same(test.py saves the txt files), i think you can try test.py first. The train.py needs a larger GPU memory.

JayAust commented 4 years ago

Thank you very much for your reply. I found that part of the test set image downloaded through Baidu Cloud was damaged, which led to the failure of the normal operation of test, and the train. py was successfully operated through the batch tuning

echoofluoc commented 4 years ago

@JayAust I wonder what performance you get after training the MBNet, I get MR results ranging from 9.2% to 9.9%(weights between e6_l224 and e8_l896) without any modification on this code

JayAust commented 4 years ago

@JayAust I wonder what performance you get after training the MBNet, I get MR results ranging from 9.2% to 9.9%(weights between e6_l224 and e8_l896) without any modification on this code I used a single RTX2080TI for 30 rounds of training without changing any code and got an MR of 10.36%, but according to the output loss, it seems that the network does not converge.Is it convenient to tell me the GPU platform you used for training？

echoofluoc commented 4 years ago

@JayAust A single TITAN RTX . The default training epoch is set to 8 by the author and they said 7 epoches are enough to get the result 8.13% MR as described in paper. I don't know why we get results far away from the paper's

JayAust commented 4 years ago

@JayAust A single TITAN RTX . The default training epoch is set to 8 by the author and they said 7 epoches are enough to get the result 8.13% MR as described in paper. I don't know why we get results far away from the paper's

Perhaps you can check the records.txt file in the same directory as the output weight file to see if loss still has a tendency to converge. I think the reason why we cannot get the effect in the paper in the seventh epoch may be that we and the original author used it. The initial weight is different, or some hyperparameters are different. I got 11.46% MR when I trained for 20 epochs. After another 10 epochs, the MR was reduced to 10.36%.

echoofluoc commented 4 years ago

@JayAust I conducted 2 experiments of 30 epochs and found the total loss did decline as it trained longer, but the MR was jsut converged to 9.3% compared to the results oscillating in the range 9.2%-9.9% of early epochs.

CalayZhou commented 4 years ago

Sorry for the delayed reply. This is caused by the unstable optimization during the training. which is also mentioned in the ALFNet issues. I think possible reason may be the cascaded RPN structure of single stage detector and the sparse pedestrian distribution in the KAIST dataset. For the tight code cleaning time, some hyperparameters i have not changed due to negligence, which will increase the instability of model training. I have fixed it in the latest released code. For example, The larger variation of hsv data augment will have a negative effect to the optimization of illumination loss, and i find if the optimization of illumination loss is hindered, the model tend to get a worse performance. In addition, a smaller batch size may lead to a more unstable model training, if you have a smaller gpu memory, i recommand reduce the channel of modality alignment module in ./keras_MBNet/model/model_AP_IAFA.py line39~49 . This has a relatively small impact on the final result and can reduce the occupation of GPU resources.

CalayZhou / MBNet

What GPU are you using #7