Open miguelcarvtalka opened 1 day ago
Hi @miguelcarvtalka, If the number of tokens after compression still exceed the context length, we will force truncate the exceed tokens in each sliding window, as implemented here. And in our reported result, we set the model_max_length to be 8k (8192) for fair comparison with baselines.
Thank you for your reply! Another question: is there a way for the model to understand which tokens are low res images and which tokens are a result of the STC module? Meaning is there a way for the model to distinguish whether a token belongs to a full image or not?
For example, just out of the top of my head you could have included extra learned tokens that delimit the full image or the output of the STC module (the tokens that changed from the first frame in the window), you could also sum learned embeddings there...
How do you ensure that the number of tokens don't surpass the max token length defined for the model? In the case of the Llama 3.2 1B decoder model, the max token length seems to be 16k, but from reading the paper no where do you have specified a max number of tokens for the video - everything seems to be threshold based meaning that it seems to be entirely possible to exceed the context window even after STC, right? What do you do in case even after STC the context still exceeds the max defined in the config file?