albanie / collaborative-experts

Video embeddings for retrieval with natural language queries
https://www.robots.ox.ac.uk/~vgg/research/collaborative-experts/
Apache License 2.0
332 stars 55 forks source link

About Features #3

Open zouying-sjtu opened 4 years ago

zouying-sjtu commented 4 years ago

Could you release the way you got these features include ocr, face, audio...

zouying-sjtu commented 4 years ago

Or, could you please give github links which are related to these feature extracting methods.

albanie commented 4 years ago

Hi @zouying-sjtu, to extract the features we used pretrained models released by the following codebases:

Lastly, some features are shared by MoEE for some of the remaining pre-extracted features. The author also has a public feature extraction pipeline which you may be interested in here. There are some descriptions of the feature extraction methods at the end of the arxiv version of the paper here, in case that's useful.

zouying-sjtu commented 4 years ago

@albanie Hello, I am not quite sure about the feature of speech and ocr. I just guess that extract sentence/words from speech or ocr, then use word2vec to extract feature(because i find the dimension of this two feature is 300 in didemo folder). And when deal with caption, extract feature from openai_gpt(because i find the dimension is 768)? Am I Right?

albanie commented 4 years ago

Hi @zouying-sjtu, yes those are both correct! Sorry for lack of clarity.

escorciav commented 4 years ago
  1. Any chance to release code to extract all those features? Maybe along the lines of disclosing details of the feature extraction rather than functional scripts that you have to maintain.

  2. Could you please confirm if the features provided are timestamped, if so the resolution, or already aggregated?

BTW. Kudos 4 your codebase looks very clean and tidy :100:

P.D. sorry for jumping out here.

albanie commented 4 years ago

Hi @escorciav,

We have a copy of the raw features (i.e. extracted densely from each frame), but in most settings we use aggregated versions (and these were what we uploaded) to avoid the download size getting too huge for the larger datasets. Happy to share the raw features if that's more useful? Features with the word max or avg in the name have been aggregated along the temporal dimension.

In terms of the feature extraction, the scripts aren't included in the current repo because we have some fairly weird filesystem issues at the moment and a significant fraction of the code I have written is purely devoted to handling the filesystem problems :( I didn't want to maintain it as a public codebase because (understandably) people would find it confusing/painful to use and I wouldn't be able to provide support.

I have config files which describe things like frame-rate, image size etc. which I can share if helpful (but realistically they might also cause more issues without their precise definitions). We are also in the process of releasing a new version of the codebase (in which we hope to include the feature extraction), but realistically that wouldn't be before CVPR. Sorry I can't be more helpful.