Open HikingDev opened 2 years ago
Any particular reason why the scientific notation is not used? 'e'
I don't remember :( Possibly it was truncating values? Really don't know. @tadeu @arthursoprana maybe you remember?
Anyway, AFAIU, the problem is that those values are then read back as "integers", so the atol
and rtol
will not work anymore, is that it?
It may be ok to change it as you suggested.
IF we run into any troubles, another suggestion (and I'm unsure if it helps) would be to, instead of relying on the data type from the CSV file (which would be mistakenly read as int
in this case), we use the dtype
passed as parameter to the check()
function.
To be more clear, in this code, instead or using
if np.issubdtype(obtained_column.values.dtype, np.inexact):
, we could try using if np.issubdtype(expected_column.values.dtype, np.inexact):
, because (I hope) the expected_column
carries the desired data type.
Let me know if that makes sense (I didn't try)
It seems that a proposed suggestion would be to change this part of the code to
data_object.to_csv(
str(filename),
float_format=f"%.{DataFrameRegressionFixture.DISPLAY_PRECISION}e",
)
I cannot recall any particular reason why %.{prec}g
is being used instead of %.{prec}e
. Perhaps we could do that?
I haven't tested, but it seems that this would not break current tests except that if anyone tries to regenerate all expected values, then there will be a diff
related to the change from g
to e
, right?
@tarcisiofischer the fix, changing to what @arthursoprana posted, fixes my issues. My current workaround is to convert Integers to Floats. The suggested fix serializes and deserializes correctly. This means that also rtol and atol work for me.
However i still think that the API should be extended to provide tolerances for Integers (exact type).
import pandas as pd
>>> df = pd.DataFrame({
... 'col_1': [1.0, 2.0, 3.0],
... 'col_2': [21.0, 31.0, 32.0],
... 'col_3': [1010, 1020, 1030]
... })
>>> df
col_1 col_2 col_3
0 1.0 21.0 1010
1 2.0 31.0 1020
2 3.0 32.0 1030
>>> df.dtypes
col_1 float64
col_2 float64
col_3 int64
>>> df = pd.read_csv('test.csv')
>>> df
Unnamed: 0 col_1 col_2 col_3
0 0 1.0 21.0 1010
>>> df.dtypes
Unnamed: 0 int64
col_1 float64
col_2 float64
col_3 int64
dtype: object
@tarcisiofischer the fix, changing to what @arthursoprana posted, fixes my issues.
Awesome! :D Could you please open an MR with this plus a test? (Test could do exactly what you are doing, check the dtypes. Should fail with previous implementation).
However i still think that the API should be extended to provide tolerances for Integers (exact type).
Agree. Are you planning tackling on this also?
Do you agree with me that this shouldn't be done using np.isclose
as implementation?
Thanks for your efforts so far :D
@tarcisiofischer I am unsure, but I will go with what the library maintainers prefer ;-)
I was digging into similar discussions and came across a proposal for Python PEP 485 – A Function for testing approximate equality which includes the section to support non float types as well with isclose().
Also the numpy.isclose() documentation mentions in the last paragraph that:
isclose is not defined for non-numeric data types. bool is considered a numeric data-type for this purpose.
which suggests that isclose() is also intended for integers. Regarding the edge cases you have mentioned, shouldnt the developer handle overflow errors?
https://numpy.org/doc/stable/user/basics.types.html#overflow-errors
I will try to provide a testcase before Wednesday (when I go on vacation). Cant promise though ;-)
If the sentiment for Integer Tolerances is towards the extension of the API, I will see what I can come up with (in about 4 weeks of time)
I cannot recall any particular reason why
%.{prec}g
is being used instead of%.{prec}e
. Perhaps we could do that?
I think that %g
was preferred because it's more readable/"human-friendly".
I haven't tested, but it seems that this would not break current tests except that if anyone tries to regenerate all expected values, then there will be a
diff
related to the change fromg
toe
, right?
Yes, it looks like this is what would happen, but in order to avoid the noise for regenerated values my opinion is that it would be better to just solve this issue with something similar to #96 (but with tests). With that we would also add support for true integer datasets.
Well, if that's the feeling I think you may just create some tests for the other issues and drop the idea of keeping the data types. I still think this is not a good approach, but I won't oppose since there are more people voting for #96 solution. I'm sorry for all the noise
Just to be clear: I prefer the changes for %g, as they solve a bug, and I think handling integers with np.isclose
is wrong, because the values will be interpreted as floats, you'll have floating point comparison and may end up in unexpected situations.
Sorry you had to go through all this @HikingDev - I sincerely was just trying to move torwards what I thought was a correct direction,
However i still think that the API should be extended to provide tolerances for Integers (exact type).
I agree on this. Let's just take care so that it's consistent with the way "closeness" is checked for float
's. The simplest way seems to be to just roll out our own isclose
for integers using the same asymmetric definition that numpy
uses:
def isclose(a: int | float, b: int | float, atol: float, rtol: float) -> bool:
return abs(a - b) <= atol + rtol * abs(b)
(Note that this is still not 100% precise because of implicit conversions to floating-point, but it seems OK even for very large integers and reasonable atol
and rtol
values -- a 100% precise implementation would require the conversion of rtol
and atol
to integer fractions)
Just to be clear: I prefer the changes for %g, as they solve a bug, and I think handling integers with np.isclose is wrong.
My point is that if integers are handled correctly, then we don't need to change %g
to %e
, which would introduce unnecessary noise to changesets when using --force-regen
, do you agree?
isclose
for integers would also make it behave better for large integers (a nice new feature, but it could be done separately if you wish).I'm unsure.
(a) Someone saves pd.DataFrame({'col_1': [1.0, 2.0, 3.0], }) those are meant to be floats.
(b) Someone loads it again. Since this PR is not merged, they are now integers.
(c) Since they are integers, they'll be handled as integer comparison.
(c.1) IF you decide to use np.isclose, then integers will be converted back to float. I find this path weird, but perhaps it is guarantee that no bugs will arise from that. But then again, having atol
and rtol
being floats is weird for integer comparison
or
(c.2) IF you decide to handle integers in a different way, then your floats are now saved as integers and integer comparison apply, which (for me) would be unexpected.
I'm really unsure if I'm missing something here. I tried to put the path "on paper" to check it myself, but isn't it what it would happen?
Perhaps (c.1) is the best path to take (and is what was been tried to be merged in the first place). If all agree on that, I won't oppose - I just wanted to be sure edge cases would be handled correctly, such as if I have numbers that are not representable as integers, or the other way around, or if I have tolerances that may truncate up vs truncate down when comparing to integers, etc.
I think that if is isclose
is consistent for float
's and int
's, then we shouldn't have the (c.2) problem. It doesn't matter that the "integer floats" got converted to int
, they would be compared in the same way as if they were float
.
I don't find having atol
and rtol
being floats that weird, it seems to be very practical, you can check that 10**30+1
and 10**30
are within a 1e-30
relative tolerance or 1.0
absolute tolerance without problems.
I agree with most of your points, they are all nice contributions! My main concerns here are on backward compatibility/consistency, and to avoid the noise on modifying all datasets.
As I said, I won't oppose.
Just for the record, this may be an example of a corner case (I may be wrong, I did it quick just to check):
If dtype is float:
>>> np.isclose([100000000000000000000.], [100000000000000000001.], atol=1)
array([ True])
But, since it is loaded as int
:
np.isclose([100000000000000000000], [100000000000000000001], atol=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 180, in isclose
File "[..]numeric.py", line 2372, in isclose
xfin = isfinite(x)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
From practical perspective, this may be minor. I won't bother you further :)
@tarcisiofischer, check it with the function I suggested on https://github.com/ESSS/pytest-regressions/issues/97#issuecomment-1165589084
(the function would need to be changed adding an all()
and iteration to account for arrays/sequences)
One of my points is that the comparison between 100000000000000000000
and 100000000000000000001
is already not supported, so it's a new feature and it could be something done separately from the fix for the current problem (if that's a burden on the implementation, but I also don't oppose on doing these things together).
Just realized that this is probably the cause of a problem I have with one of my tests being flaky. Any news on when a fix might be finished? Looks like the conversation died a while back. Or did this get resolved elsewhere somehow?
Thanks!
In the past month I have twice encountered flaky tests (related with rounding differences between windows and numpy) in two different projects where the expected value is a float, but the observed turned out to be an integer, e.g.:
E obtained_x expected_x diff
E 19 19.00000000000001066 0.00000000000001066
E 19 19.00000000000001066 0.00000000000001066
This test failure occurred even though the tolerance was set to default_tolerance={"atol": 1e-3}
.
The failure happens because of the combination of two root causes: (1) the conversion of floats into integers due to the use of %.{prec}g
instead of %.{prec}e
(see above), and (2) checking only the dtype of the obtained_column in https://github.com/ESSS/pytest-regressions/blob/master/src/pytest_regressions/dataframe_regression.py#L131:
if np.issubdtype(obtained_column.values.dtype, np.inexact):
Fixing the second root cause is straightforward and I think much less problematic with regard to backward compatibility/consistency than fixing the first root cause, while being sufficient to fix those flaky tests:
if np.issubdtype(obtained_column.values.dtype, np.inexact) or np.issubdtype(expected_column.values.dtype, np.inexact):
Of course, this fix still won't be sufficient if atol >= 1 AND both obtained and expected are integers, but I think it's sufficient for all other cases.
Alternatively, one can also simply test for Number instead of floats, as suggested in https://github.com/ESSS/pytest-regressions/issues/92, which I think would solve all cases:
np.issubdtype(obtained_column.values.dtype, np.number):
def _dump_fn
The serialized csv:
https://docs.python.org/3/library/string.html#format-specification-mini-language
According to the python docs the presentation type 'g' to format strings as a flaw.
Any particular reason why the scientific notation is not used? 'e'