IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

Not all 'times_norm' positions accessible #253

Closed PeterGilles closed 2 years ago

PeterGilles commented 3 years ago

When filtering a trackdata tibble that has been normalised with 'normalize_length', not all time positions created are accessible, i.e. through filtering, although all values are present in the data frame. This strange behaviour can be replicated with the ae database. Position 0.2 is correctly filtered, while position 0.6 gives zero results.

Bildschirmfoto 2021-08-19 um 14 02 04

This happens to time positions 0.15, 0.3, 0.35, 0.6, 0.7, 0.85, 0.95.

It might be the case that this is not an emuR problem, but related to a bug in dplyr?

MJochim commented 3 years ago

It’s not a bug, it’s a feature ;-).

This is most likely not a bug in dplyr, nor in emuR, but a property of how computers handle decimal numbers – or a “bug” in computing in general, if you will: What you see as times_norm = 0.6 might very well be something like times_norm = 0.600000001 internally and therefore fail the comparison times_norm == 0.6. Therefore, whenever you handle decimal numbers, be wary of comparing them without giving some tolerance. In fact, even when you think you are dealing with whole numbers only, be wary of that.

The only place where you can safely compare numbers without any tolerance is (when you exactly know how your computer handles them or) when your programming language knows that you have an int-typed number (integer), which you can see, for example, in your tibble output in the line below the headings for sl_rowIdx. Whenever the programming language considers your number a floating point number, or in short: float, or “double-precision float” – hence the <dbl> in the tibble output for start etc. –, take care of this problem.

You can find an illustration and a thorough explanation and also a number of hints what you can do about it at https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal/9508558.

PeterGilles commented 3 years ago

Thanks, ich habe so etwas befürchtet. :)

PeterGilles commented 3 years ago

Some hack like dplyr::filter(times_norm >= time & times_norm <= time + 0.05) does the trick.

MJochim commented 3 years ago

I just had a closer look at the StackOverflow I linked to and I think their last variant rather intriguing:

dpyr::filter(dplyr::near(times_norm, time))

I didn’t know about the near function before. It does exactly what we are looking for here, and in a concise way that nicely fits into the tidyverse. Also, they have chosen a default value for tolerance that appears to be very clever. I’d have to read up on floating-point arithmetics to vouch for it; but then again, the dplyr folks don't really need me to vouch for their work ;-).

Note that in your solution, the tolerance is only above time, not below it.