Question about the DETR pretraining process

Thanks for the impressive work. I have one question about the pretraining process of DETR (of which you've mentioned here: https://github.com/amazon-science/tubelet-transformer#training)

From here (https://github.com/amazon-science/tubelet-transformer/issues/4#issuecomment-1236167059), I figured that you've brought the DETR weights trained on COCO dataset and re-trained it on AVA to detect human instances.

Could you describe this process in a more detailed way? (e.g., how did you manipulated the DETR structure to only detect human, what exactly was the input, position embedding, ... etc)
Was your intention of this pretraining to make queries focus more on classification after DETR architecture of TubeR learns how to localize actors well enough?
Have you tried training the whole architecture without the pretrained DETR weights? I've tried several times but could not find a good configuration to make the actual learning happen.

Thanks in advance.

amazon-science / tubelet-transformer