dk-liang / CLTR

[ECCV 2022] An End-to-End Transformer Model for Crowd Localization
MIT License
87 stars 13 forks source link

How to understand Object Queries in Crowd Counting task? #23

Closed congyi-lcy closed 11 months ago

congyi-lcy commented 1 year ago

I would like to ask the author about the explanation of object queries.

From DETR, we know that object queries, that is, the decoder of the transformer, will generate N predictions at one time. Among them, N is a pre-set integer that is at least greater than the number of objects in the picture, and then this N is the value of object queries.

However, in the crowd counting task here, there are often thousands or tens of thousands of people in a picture. I saw that the author set the object queries to 700 or 500. How do I understand this? If object queries are defined in DETR, should it be set to a value of several thousand? But it feels so strange, can the author share his understanding of object queries? I would appreciate it.

Faisal-Hajari commented 11 months ago

Hi, the way I see it that the code actually crop the images into 12 crops each have the size of 256x256. so basically any image you have will be treated as a batch of images. each crop should hold your assumption and the way the queries work. i.e. in each crop you can at most predict 700/500 people.

dk-liang commented 11 months ago

Thanks for the clarification

xcaizewu commented 3 months ago

I believe you haven't carefully reviewed the code. During the training process, the author requires that the number of GT points in the cropped 256x256 images should be greater than 0 and less than 500; otherwise, it should be re-cropped. In the testing process, although the author has padded the sides that cannot be evenly divided by 256 and then cropped the images into multiple 256x256 patches, each patch still queries for a maximum number of 500. I think there is a problem with this. @Faisal-Hajari Can you explain that? @dk-liang

dk-liang commented 3 months ago

The query number is a hyperparameter. Actually, we find that nearly all cropped patches contain less than 500 people. It can cover most dense cases. Also, we think a promising direction is to design the dynamic number query.

xcaizewu commented 3 months ago

The query number is a hyperparameter. Actually, we find that nearly all cropped patches contain less than 500 people. It can cover most dense cases. Also, we think a promising direction is to design the dynamic number query.

谢谢 明白了