SilvioGiancola / SoccerNetv2-DevKit

Development Kit for the SoccerNet Challenge
MIT License
168 stars 39 forks source link

Questions Regarding Baidu Embeddings #51

Closed yur1xpp closed 2 years ago

yur1xpp commented 2 years ago

Hi Silvio, I have some questions about the baidu emdeddings, I wonder if you have any information about them. I couldn't find any information regarding these questions from either of their github repo or the published paper:

  1. Is the baidu embeddings already gone through PCA, or was it still a "raw" features? I noticed it's in Tx8576 dimension, which could probably mean they are still "raw", am I right about this?
  2. In your opinion, if I were to reduce the dimension, would it be better to have PCA reduce them, or have them go through a FCL like in the implementation of TemporallyAwarePooling? https://github.com/SilvioGiancola/SoccerNetv2-DevKit/blob/20f2f74007c82b68a73c519dff852188df4a8b5a/Task1-ActionSpotting/TemporallyAwarePooling/src/model.py#L32
  3. If they are still raw, do you have any idea what was the initial dimension before they are flatten to 8576? They used 398x224 video as mentioned in the paper, but it's not possible to reshape them to it. I was thinking maybe I could used them in video transformer based architecture (MViT etc) if I'm able to reshape them to the original dimension.
  4. Do we have any of their fine-tuned feature extraction code publicly available? I think not, but I'm just going to ask anyway in case you know any since their embeddings have very few public information available.

Thanks!

SilvioGiancola commented 2 years ago

Hi @yur1xpp

  1. AFAIK, those are "raw" features, not PCA'ed
  2. If you have memory to spare to train your architecture, I would go for a FCL as it drops the orthogonality constraints of the PCA. In contrast, the PCA is fully unsupervised, so you can train it offline before using the PCA'ed feature in the remaining of your network.
  3. The dimension 8576 come from the concatenation of 5 or 6 different features, from different encoders. You will need to analyzed that structure before getting back to a 2D map. Also, I believe not all features are extracted right after the flattening layer, but might have extra FC layers.
  4. Not that I am aware of. You might want to raise that issue on the github page from Baidu and put pressure on whomever extract custom features to release code publicly, both for feature extraction on new videos or to fine-tune the video encoder on soccer videos.

I hope that helps!

yur1xpp commented 2 years ago

Thanks very much for those useful details & suggestions Silvio!

1-2. It looks like FCL does indeed required a lot of memory, best strategy for me is probably PCA it seems. 3-4. That's a good idea! I should open an issue on their repo and see how it goes. I was trying to extract my own feature using their technique but the details on their paper is quite vague.

Thanks again for the help!

yur1xpp commented 2 years ago

I found a closed issue on their repo, it might have answered some of the questions, but looks like they're not planning on releasing the code publicly (at least from a comment dated a year ago), unfortunately. https://github.com/baidu-research/vidpress-sports/issues/4#issuecomment-941297411

SilvioGiancola commented 2 years ago

That's a pity, but I am sure fine-tuning even simple encoders like ResNET or an MViT using a similar trick would lead to very similar boost in performances. If you have reproducible code that train a better encoder, I would be happy to advertise it on our soccer-net github repos and websites, as it could very useful for future development.