Closed NguyenVanThanhHust closed 1 year ago
I am guessing this has to do with the way python does "threading". As long as you are executing python code and do not dispatch in C libraries there is no parallel processing due to pythons global lock: https://docs.python.org/3/glossary.html#term-global-interpreter-lock All VPF operations return rather quick which will return into this lock again so you might want to look into multiprocessing instead of threading. Another option would be to do decode+classify in multiple threads instead of producer consumer scheme. Pytorch will dispatch into a C library for the network to execute "longer" leaving more time for another thread to process. I hope this makes sense and is not too misleading.
Also in general - have you identified having everything in one thread as a bottleneck ? You could check this by the GPU utilization or by profiling with Nsight systems.
Acctually, I tried multi process first. But it seems CUDA doesn't allow to use multi process to decode on GPU.
For decode + classify in multi threads, it seems it would create multi model on GPU, which is quite memory consumming.
I tried both multi processing from pure python and Pytorch.
have you identified having everything in one thread as a bottleneck?
Actually, no, single thread is fine for me now. For my application, I can create many decoder, push result to fixed size list then use pytorhc model on that list.
I just wonder if there is a way to use multi thread to optimizer further (I suppose?).
Tks.
Hi. I want to use multi threading decoded and classify with pytorch.
I tried SampleDecode Multi Threading and SamplePytorchh Resnet seperately successfully.
I tried to push decoded imgs to queue and pytorch read from queue. I created threads to decode and 1 thread to clasiffy But the thread that process video doesn't start after all previous decoded thread finish
Can you make an example how to do that? That would be very help full.
Below is my code when I tried to use mentioned approach