First off, great job on this project. I was averaging about 180 seconds to perform the inference on test12. I noticed some optimizations that could be done in wav2vec.py, wav2vecDS.py, and infference.py. I still have more to do on the inference.py but I'm now completing the inference in 64 seconds. I'm using an older GPU (RTX 2060). So with a newer GPU, this may be able to get closer to real-time. I also added a check if frames have already been extracted and to bypass frame extraction. This suits my use case where I want to run inference repeatedly on a few videos only. When skipping extraction, I can now process in 36 seconds. If interested I can share the code.
First off, great job on this project. I was averaging about 180 seconds to perform the inference on test12. I noticed some optimizations that could be done in wav2vec.py, wav2vecDS.py, and infference.py. I still have more to do on the inference.py but I'm now completing the inference in 64 seconds. I'm using an older GPU (RTX 2060). So with a newer GPU, this may be able to get closer to real-time. I also added a check if frames have already been extracted and to bypass frame extraction. This suits my use case where I want to run inference repeatedly on a few videos only. When skipping extraction, I can now process in 36 seconds. If interested I can share the code.