event-driven-robotics / four-dof-affine-tracking

BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Literature #1

Open arrenglover opened 1 year ago

arrenglover commented 1 year ago

Visual tracking is one of the earliest applications showcasing event-camera potential, in particular Delbruck's Robot Goalie~\cite{Delbrucksgoalie} and Conradt's pencil balancer~\cite{5457625}. Since then there have been a growing interest, and body of work, on event-driven tracking. We are interested in the problem of continuous estimation of an object state relative to the camera, in which both the camera and object move in a textured and cluttered environment.

Tracking the camera's motion itself receives much attention and development of novel algorithms and methods~\cite{EVO, recent_slam?}. The task overlaps with object tracking; i.e. time evolving state estimation, and therefore inspiration for event processing can be drawn from the topic. However, as object tracking has (by definition) dynamic environment and requires segmentation of the target object, algorithms need to be adapted to the target domain.

Much work also exists in tracking visual features, such as corners[] or learned features[]. Such features correspond to a set of spatially continuous patterns, instead of a complex or piecewise object appearance. Features can also be aliased across objects and it is also expected that features adapt and disappear over time. The literature has shown strong position priors are effective tracking assumptions that take advantage of event-camera characteristics (e.g. dense, continuous trails), which should also hold true for object tracking.

A particular feature that naturally arises from event data, and has been used for object tracking, is that of clusters. Clusters have been proposed as tracking objects by parts[], but more often also as representing an entire object as a single cluster[Barranco2018, PiatkowskaPeopleTracking]. The latter works best if it can be guaranteed that objects are always spatially separate and do not traverse in front of other texture in the scene, which it typically untrue for moving camera scenarios.

Finally, generic object segmentation can be performed with simultaneous camera and object motion based on a difference between background and object motion dynamics~\cite{MitrokhinsSegmentationMotion}. Such methods assume that the object is always in motion, with periods of constant velocity. In conditions in which the robot is moving the camera to match speeds with the object, this assumption doesn't hold as the relative speed between camera and object tends to zero, and due to robot control, has inconsistent velocities.

The above methods of camera motion tracking, cluster tracking, and velocity-vector-based segmentation are all unsuitable for the proposed application domain. Therefore we are interested in the particular object tracking sub-domain in which the target is known a-priori, and contraints about relative motion between camera, object, and environment are eliminated. In this sub-domain we have shown our prior work on tracking a circle (i.e. a ball) in position and size~\cite{gloversPF}, but is not designed to generalise well to other objects. Since then other works that can possibly track a particular shape have consisted of employing off-the-shelf computer vision algorithms on custom event representations~\cite{ChensATSLTD}, which uses a ``EdgeBoxes'' detection. Additionally trained networks~\cite{ZhangsHybridBBDetection, ChensATSLTDwithNetwork, ZhangsSpikingAffineEstimator} for object tracking have the ability to be selective to particular objects in the training set.

However, the works in~\cite{ChensATSLTD, ChensATSLTDwithNetwork, ZhangsSpikingAffineEstimator} works still only consider camera motion against a static environment (in which the tracked objects are static), in this scenario the relative contrast between the object and the background remains constant, making the object appearance also constant. In a task in which the object moves relative to the background, the contrast strength, and hence events produced changes over time making it more difficult for tracking. In addition, if the camera is closed-loop tracking the object, the appearance of the object changes drastically as the relative motion between the camera and object goes to zero (and hence few events are produced).

In addition, all of these works it isn't clear if the systems can run online and in real-time. Even if computation times are mentioned, until actual closed-loop robotics task are performed pipeline bottlenecks often go unnoticed, for example pre-processing the event-stream may add significant latency for fast camera motion.

arrenglover commented 1 year ago
@ARTICLE{Delbrucksgoalie,
AUTHOR={Delbruck, Tobi and Lang, Manuel},   
TITLE={Robotic goalie with 3 ms reaction time at 4% CPU load using event-based dynamic vision sensor},      
JOURNAL={Frontiers in Neuroscience},      
VOLUME={7},           
YEAR={2013},      
URL={https://www.frontiersin.org/articles/10.3389/fnins.2013.00223},       
DOI={10.3389/fnins.2013.00223},      
ISSN={1662-453X}}

https://www.frontiersin.org/articles/10.3389/fnins.2013.00223/full

Assumes objects are clusters

arrenglover commented 1 year ago
@INPROCEEDINGS{Barranco2018,
  author={Barranco, Francisco and Fermuller, Cornelia and Ros, Eduardo},
  booktitle={2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
  title={Real-Time Clustering and Multi-Target Tracking Using Event-Based Sensors}, 
  year={2018},
  volume={},
  number={},
  pages={5764-5769},
  doi={10.1109/IROS.2018.8593380}}

https://ieeexplore.ieee.org/abstract/document/8593380

Shapes, but assumes clustering

arrenglover commented 1 year ago
@INPROCEEDINGS{gloversPF,
  author={Glover, Arren and Bartolozzi, Chiara},
  booktitle={2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
  title={Robust visual tracking with a freely-moving event camera}, 
  year={2017},
  volume={},
  number={},
  pages={3769-3776},
  doi={10.1109/IROS.2017.8206226}}

https://ieeexplore.ieee.org/document/8206226

Tracks a circle shape only, particle filter

arrenglover commented 1 year ago
@INPROCEEDINGS{5457625,
  author={Conradt, Jorg and Berner, Raphael and Cook, Matthew and Delbruck, Tobi},
  booktitle={2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops}, 
  title={An embedded AER dynamic vision sensor for low-latency pole balancing}, 
  year={2009},
  volume={},
  number={},
  pages={780-785},
  doi={10.1109/ICCVW.2009.5457625}}

https://ieeexplore.ieee.org/document/5457625

arrenglover commented 1 year ago
@InProceedings{ZhangsSpikingAffineEstimator,
    author    = {Zhang, Jiqing and Dong, Bo and Zhang, Haiwei and Ding, Jianchuan and Heide, Felix and Yin, Baocai and Yang, Xin},
    title     = {Spiking Transformers for Event-Based Single Object Tracking},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {8801-8810}}

https://openaccess.thecvf.com/content/CVPR2022/html/Zhang_Spiking_Transformers_for_Event-Based_Single_Object_Tracking_CVPR_2022_paper.html

Can be specific to the object as it is trained. However, only has datasets with camera motion and static objects. Clutter, but always same surrounding clutter. No relative motion between the object and the environment, which creates different levels of contrast, changing the event-based signal of the object.

probably not real-time

interesting as it uses a spiking transformer which is quite "SOTA"

arrenglover commented 1 year ago
@INPROCEEDINGS{PiatkowskaPeopleTracking,
  author={Piątkowska, Ewa and Belbachir, Ahmed Nabil and Schraml, Stephan and Gelautz, Margrit},
  booktitle={2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops}, 
  title={Spatiotemporal multiple persons tracking using Dynamic Vision Sensor}, 
  year={2012},
  volume={},
  number={},
  pages={35-40},
  doi={10.1109/CVPRW.2012.6238892}}

https://ieeexplore.ieee.org/document/6238892

persons can be considered objects assumes people are the only thing moving, and each person forms a separable cluster

arrenglover commented 1 year ago
@article{ChensATSLTDwithNetwork, 
title={End-to-End Learning of Object Motion Estimation from Retinal Events for Event-Based Object Tracking}, 
volume={34}, 
DOI={10.1609/aaai.v34i07.6625}, 
number={07}, 
journal={Proceedings of the AAAI Conference on Artificial Intelligence}, 
author={Chen, Haosheng and Suter, David and Wu, Qiangqiang and Wang, Hanzi}, 
year={2020}, 
month={Apr.}, 
pages={10534-10541} }

https://ojs.aaai.org/index.php/AAAI/article/view/6625

Interesting as it is similar to ours in that it learns an Affine transform. I guess the initial locations of objects are given, and then the network estimates \Delta X between consecutive "ATSLTD".

Why bad? ATSLTD I don't think can be done real-time. they pre-process the event stream and create the images. Datasets are just camera moving, so the background and the objects move the same way, no relative motion. I think our scenario would trick the network and not work. uses a network to do a simple motion estimate? why not use a convolution - cheaper doesn't require training.

says it runs real-time but without full pipeline testing in closed-loop robot experiments it can't be proven.

arrenglover commented 1 year ago
@inproceedings{ChensATSLTD,
author = {Chen, Haosheng and Wu, Qiangqiang and Liang, Yanjie and Gao, Xinbo and Wang, Hanzi},
title = {Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-Based Object Tracking},
year = {2019},
isbn = {9781450368896},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3343031.3350975},
doi = {10.1145/3343031.3350975},
booktitle = {Proceedings of the 27th ACM International Conference on Multimedia},
pages = {473–481},
numpages = {9},
location = {Nice, France},
series = {MM '19}
}

https://dl.acm.org/doi/10.1145/3343031.3350975 (PDF)

Uses a "adaptive time surface with linear time decay". New frames only use "new" events so there is no persistence. Decays the entire image based on temporal information -> slower than EROS's region of interest. requires a separate trigger to say when a frame is "ready". The trigger is based on average entropy of grid array placed over the ATSLTD - seems slow to compute too.

Uses a tracker on ATSLTD to propose object positions globally across the image, a tracker then chooses the data association over time. greyscale Frames are generated from events to perform failure recovery.

makes experiment comparing frame-based tracking methods on the event-driven images, but not the event-driven images themselves. no ablation study. 30 ms per update = 33 Hz, but not sure if it is real-time (i.e. they might have more than 30 images per second to achieve tracking).

The objects don't more relative to the background, but only the camera moves

arrenglover commented 1 year ago
@INPROCEEDINGS{MitrokhinsSegmentationMotion,
  author={Mitrokhin, Anton and Fermüller, Cornelia and Parameshwara, Chethan and Aloimonos, Yiannis},
  booktitle={2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
  title={Event-Based Moving Object Detection and Tracking}, 
  year={2018},
  volume={},
  number={},
  pages={1-9},
  doi={10.1109/IROS.2018.8593805}}

https://ieeexplore.ieee.org/document/8593805

this is the one that requires motion different to the camera ego-motion

arrenglover commented 1 year ago
@INPROCEEDINGS{ZhangsHybridBBDetection,
  author={Zhang, Jiqing and Yang, Xin and Fu, Yingkai and Wei, Xiaopeng and Yin, Baocai and Dong, Bo},
  booktitle={2021 IEEE/CVF International Conference on Computer Vision (ICCV)}, 
  title={Object Tracking by Jointly Exploiting Frame and Event Domain}, 
  year={2021},
  volume={},
  number={},
  pages={13023-13032},
  doi={10.1109/ICCV48922.2021.01280}}

https://ieeexplore.ieee.org/document/9710163

uses frames and events, uses new events in a frame, but normalises them. trained on objects for bounding box regression, static camera. neural network for detection and classification, less of tracking, relies on frames a lot probably not real-time