LJMUAstroecology / flirpy

Python library to interact with FLIR camera cores
Other
184 stars 54 forks source link

RGB synchronization fails if there are more IR than RGB frames #54

Open LukasBommes opened 2 years ago

LukasBommes commented 2 years ago

I wanted to point out a bug in the split_seqs script. The synchronization between IR and RGB works fine as long as there are more RGB than IR frames. However, if there are more Ir than RGB frames, the synchronization logic fails.

I ran into this issue after switching from the 8 Hz Flir Duo Pro R to the 30 Hz version. As the visual stream is at 29.87 Hz, there are more IR than RGB frames generated.

jveitchmichaelis commented 2 years ago

Thanks for this, I'll take a look! The split code was built and tested with the 30Hz version in mind, so this may be a regression bug.

LukasBommes commented 2 years ago

If you want, I can send you my code in about two weeks (I'm currently on holiday). The logic is the same as in your code, just that I invert everything in case n_IR > n_RGB.

However, even after this fix snychronization is still rather poor. For long sequences (~20k frames) the IR and RGB streams are up to 10 frames off in the middle of the sequence. Also, for my camera (Zenmuse XT2) the IR stream hangs once in a while when the camera performs recalibration.

I was thinking of using the frame timestamps (from EXIF tags) for synchronization. In the TIFF stack (which I use instead of SEQs) each frame has a millisecond-accurate walltime associated to it. For the RGB stream there is only a millisecond-accurate relative timestamp starting from zero. However, I am in doubt about the timestamps of the IR stream as the recalibration procedure does not show up here.

It would be interesting to know whether you are able to synchronize your streams properly. So far, the problem seems quite tough. I was even thinking of extracting descriptors from both IR and RGB stream and matching descriptors of each IR frame to temporally neighbouring RGB streams. Something like this: https://la.disneyresearch.com/publication/actionsnapping/ The main difficulty is that we have two different modalities, which makes typical feature descriptors, such as ORB, SIFT, and Bag-of-words unsuitable.

jveitchmichaelis commented 2 years ago

Sure! This should be a simple fix and it's on my radar anyway. Sync is very difficult, primarily because there is (a) a non-zero start offset between the two streams (b) pretty much all the cameras break synchronisation when they flat-field (I guess we could detect this by detecting static IR frames?). In practice I've had to do this manually most of the time. I tried a few approaches offline including FFT-based matching, but none of them worked particularly well (if at all). There is some literature on IR-RGB fusion from descriptors, but I wasn't able to get it to work with my data. Really we need a good synced dataset to work with - and maybe then we could just train a CNN or something (e.g. one backbone network for each modality and then train on the loss between the outputs).

On Sat, Aug 28, 2021 at 4:32 PM Lukas Bommes @.***> wrote:

If you want, I can send you my code in about two weeks (I'm currently on holiday). The logic is the same as in your code, just that I invert everything in case n_IR > n_RGB.

However, even after this fix snychronization is still rather poor. For long sequences (~20k frames) the IR and RGB streams are up to 10 frames off in the middle of the sequence. Also, for my camera (Zenmuse XT2) the IR stream hangs once in a while when the camera performs recalibration.

I was thinking of using the frame timestamps (from EXIF tags) for synchronization. In the TIFF stack (which I use instead of SEQs) each frame has a millisecond-accurate walltime associated to it. For the RGB stream there is only a millisecond-accurate relative timestamp starting from zero. However, I am in doubt about the timestamps of the IR stream as the recalibration procedure does not show up here.

It would be interesting to know whether you are able to synchronize your streams properly. So far, the problem seems quite tough. I was even thinking of extracting descriptors from both IR and RGB stream and matching descriptors of each IR frame to temporally neighbouring RGB streams. Something like this: https://la.disneyresearch.com/publication/actionsnapping/ The main difficulty is that we have two different modalities, which makes typical feature descriptors, such as ORB, SIFT, and Bag-of-words unsuitable.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LJMUAstroecology/flirpy/issues/54#issuecomment-907650605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYDMJY22DYHEAGBJ2BYH4TT7EFSDANCNFSM5CNWASIA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

LukasBommes commented 2 years ago

After my holiday, I will take closer look into the synchronization. I was thinking of doing it the following way: 1) calibrate intrinsics of both cameras 2) undistort IR and RGB frames 3) find a homography which maps the IR onto the RGB frame (in my case working distance is 10..20 meters while the baseline is a few centimeters, so a homography should be a reasonable approximation) 4) coarsely align streams using your code, i.e. assuming constant frame rates and zero starting-offset 5) performing fine-grained matching of IR and RGB frames in a local neighborhood (e.g. +-20 frames) based on an image-level similarity metric, such as mutual information, cross-correlation, etc. (maybe after applying a low-pass filter, histogram equalization, ...)

The latter step would certainly require some experimentation. Alternatives would be feature-based similarity metrics or extraction and matching of shapes, such as line segments. May I ask, which IR-RGB descriptors you tried out? I found this one, which looks promising: https://www.mdpi.com/1424-8220/20/18/5105 Another way would be to extract keypoints from IR and RGB and find matches based on a geometric constraint (e.g. the homography or more generally the fundamental matrix). The frame with lowest median spatial distance between matched keypoints would then be selected as match.

A CNN is probably also an option, but it would have to be done in an un- or self-supervised manner since I have no idea how to acquire the ground-truth for synchronization (maybe some lightbulb blinking pattern which encodes walltime...). CNN-based snychronization was attempted here: https://arxiv.org/pdf/1610.05985.pdf Even though nowadays one would probably want to use an N-pair loss instead of the triplet loss.

Did you also notice that the frame rate of the MOV file differs between videos? For my cameras it is 30 Hz, 29.87 Hz or 29.xx Hz for different MOV files (read out with ffprobe).