SRA2 / SPELL

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection (ECCV 2022)
MIT License
64 stars 9 forks source link

Vertex Identifying when inference time #11

Open GuSangmo opened 1 year ago

GuSangmo commented 1 year ago

Hi Kyle, thank you for such a nice paper! I really learned a lot from your work.

Currently, I am trying to tune your model for inference in ASD task (for random video, with no annotation about any bbox or entity)

As mentioned in #4 , this code assumes face bbox and audio-visual features be detected by other models. (in this case, your model used 2DResNet with TSM for feature extraction)

I understood SPELL works on pre-built graph data, where vertex is identified with (video_id, entity_id, timestamp). Could you give me some advice on how to get entity_id when inference?

I tried to integrate L186 ~ L223 of data_loader.py into inference loop as you mentioned, but it seems to require entity_id for node. I thought adding some tracker would help, but I am curious if modifying some of your codes might help this.

Thank you so much, Sangmo

kylemin commented 1 year ago

Hi Sangmo,

Yes, you can use a tracking algorithm to assign the same entity_id to all the face-crops of each person. I think you can refer to Ego4D's starter code to see how it tracks a face-crop across the frames. It uses a short-term tracking algorithm to link face-crops with the same entity_id. There are many real-time algorithms for this type of short-term tracking.

I hope this helps.

Thank you, Kyle

GuSangmo commented 1 year ago

I really appreciate your tips, Kyle! Could you answer me one more question?

I tried to extract feature maps with STE(resnet18-tsm-aug) weight you gave, but I couldn't replicate the accurate result like yours.

I used AudioVideoDatasetAuxLossesForwardPhase(..., clip_length = 11, target_size = (144,144)) , resnet18_two_streams_forward(rgb_stack_size = 11) from models_stage1_tsm.py and STE_forward.py.

I thought data loading technique can be little different with your TSM inference code, since using the original STE_forward.py can make the same data for the first few samples.

# This way dataset[0] and dataset[2] has same video & audio data
# Because of the loading logic by ASC requires padding
dataset = AudioVideoDatasetAuxLossesForwardPhase(...)
audio_data, video_data, .... =  dataset[idx]

Was there any slight modifiation to data loading, or STE_forward.py? (Your bbox feature concatenation part seems to be in SPELL logic part, I guess)

It would be a big help if I could know there was something magic or my mistake. Otherwise I could work with the given pretrained weight.

Best regards, Sangmo

hugoobui commented 1 year ago

Hello @GuSangmo, I'm also working on making this code for real-time purposes. Did you manage to do it ?

GuSangmo commented 1 year ago

@hugoobui Sorry for late reply. I couldn't do it, so I chose the other model(LightASD).

In the case of GNN-postprocessing models(e.g. EASEE, SPELL), I thought bbox-tracking should be done in advance, to build the graph for postprocessing. And I couldn't reproduce those feature-encoder part ;-)

My task was due 23 August, so there can be some improvements in this domain. LightASD is much faster, but I also thought additional engineering is necessary( i.e. wait for 5 frames to be encoded to the model)! I hope you could tackle this problem. Best regards, Sangmo