Curious about the implementation of the SeqAE

Hi,

Thank you for the impressive work; the results of the tokenization is very impressive. However, I have a question regarding a specific detail in your paper.

It states in the paper, "In SeqAE, raw segment pixels and masks are flattened into data sequences to maximize the use of context length."

I want to ensure my understanding is correct: Does this mean that for each image segment, it is padded with zeros into a rectangular shape of width*height =1024, and then all the pixels are flattened into a sequential format, and for the pixels longer than the context length we just drop it? And with those sequence we train encoder decoder with the RGB loss on the pixels.

Thank you again for your time.

Sincerely, ZiAng

ChenDelong1999 / subobjects

Curious about the implementation of the SeqAE #2