Open GuSangmo opened 1 year ago
Hi Sangmo,
Yes, you can use a tracking algorithm to assign the same entity_id
to all the face-crops of each person.
I think you can refer to Ego4D's starter code to see how it tracks a face-crop across the frames. It uses a short-term tracking algorithm to link face-crops with the same entity_id
. There are many real-time algorithms for this type of short-term tracking.
I hope this helps.
Thank you, Kyle
I really appreciate your tips, Kyle! Could you answer me one more question?
I tried to extract feature maps with STE(resnet18-tsm-aug) weight you gave, but I couldn't replicate the accurate result like yours.
I used AudioVideoDatasetAuxLossesForwardPhase(..., clip_length = 11, target_size = (144,144))
, resnet18_two_streams_forward(rgb_stack_size = 11)
from models_stage1_tsm.py
and STE_forward.py
.
I thought data loading technique can be little different with your TSM inference code, since using the original STE_forward.py can make the same data for the first few samples.
# This way dataset[0] and dataset[2] has same video & audio data
# Because of the loading logic by ASC requires padding
dataset = AudioVideoDatasetAuxLossesForwardPhase(...)
audio_data, video_data, .... = dataset[idx]
Was there any slight modifiation to data loading, or STE_forward.py
?
(Your bbox feature concatenation part seems to be in SPELL logic part, I guess)
It would be a big help if I could know there was something magic or my mistake. Otherwise I could work with the given pretrained weight.
Best regards, Sangmo
Hello @GuSangmo, I'm also working on making this code for real-time purposes. Did you manage to do it ?
@hugoobui Sorry for late reply. I couldn't do it, so I chose the other model(LightASD).
In the case of GNN-postprocessing models(e.g. EASEE, SPELL), I thought bbox-tracking should be done in advance, to build the graph for postprocessing. And I couldn't reproduce those feature-encoder part ;-)
My task was due 23 August, so there can be some improvements in this domain. LightASD is much faster, but I also thought additional engineering is necessary( i.e. wait for 5 frames to be encoded to the model)! I hope you could tackle this problem. Best regards, Sangmo
Hi Kyle, thank you for such a nice paper! I really learned a lot from your work.
Currently, I am trying to tune your model for inference in ASD task (for random video, with no annotation about any bbox or entity)
As mentioned in #4 , this code assumes face bbox and audio-visual features be detected by other models. (in this case, your model used 2DResNet with TSM for feature extraction)
I understood SPELL works on pre-built graph data, where vertex is identified with (video_id, entity_id, timestamp). Could you give me some advice on how to get entity_id when inference?
I tried to integrate L186 ~ L223 of data_loader.py into inference loop as you mentioned, but it seems to require entity_id for node. I thought adding some tracker would help, but I am curious if modifying some of your codes might help this.
Thank you so much, Sangmo