-
Hi, when you extract RGBDiff, do you use frame at t+1 to minus frame at t without downsampling, and then do downsampling, or do you first do downsampling and then extract the difference?
And do you f…
-
-
# Project Request
Automatic captioning of videos based on cooking by understanding the action behind the scene and providing Nutrtional information based on the ingredients used.
| Field | D…
-
this is the results i've got on MSRVTT, which is really far worse than the paper results:
There must be something wrong in my test process and here's how i get this:
1. I've tried to run the text-…
-
Hello InternVideo team,
You guys have done a great job with this project!
In your paper, you use the Stage 2 model for the task of temporal grounding on QVHighlight [Lei et al., 2021] and Charad…
-
### Version
1
### DataCap Applicant
Xenogic
### Project ID
Xenogic027
### Data Owner Name
Google DeepMind
### Data Owner Country/Region
United States
### Data Owner Indus…
-
My goal is to build a unique multimodal WooCommerce search experience with Vespa multivectors and an hybrid ranking on text-BM25, text-vectors, and image-vectors.
For instance, E-commerce can use:
…
-
HI @linjieli222,
I was looking at the validation phase of the retrieval setup and I can see you have implemented two different variants: `validate` and `full_validate`. From my understanding, `vali…
-
Hy,
My problem:
remote mp4 url structure:
http://url.hu/folder/folder/folder/file.mp4?token=randomhash
location ~ /remoteupstream/([^/]+)/([^/]+)/(.*) {
resolver 8.8.8.8;
internal;
proxy_pa…
-
### Model description
[CLIP-ViP](https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP) is a video-language model which is based on a pre-trained image-text model [CLIP](https://openai.com/blog/c…