ESSS / pytest-regressions

Pytest plugin for regression testing: https://pytest-regressions.readthedocs.io
MIT License
182 stars 36 forks source link

The serialization of a DataFrame with floats automatically casts floats into integers #97

Open HikingDev opened 2 years ago

HikingDev commented 2 years ago

def _dump_fn

import pandas as pd

df = pd.Dataframe({
                        'col_1': [1.0, 2.0, 3.0], 
                        'col_2': [21.0, 31.0, 32.0],
                        'col_3': [101.0, 102.0, 103.0]
         })

df.to_csv("test.csv", float_format=f"%.17g")       

df
#   col_1  col_2  col_3
# 0    1.0   21.0  101.0
# 1    2.0   31.0  102.0
# 2    3.0   32.0  103.0

df.dtypes
# col_1    float64
# col_2    float64
# col_3    float64
# dtype: object

The serialized csv:

,col_1,col_2,col_3
0,1,21,101
1,2,31,102
2,3,32,103

https://docs.python.org/3/library/string.html#format-specification-mini-language

According to the python docs the presentation type 'g' to format strings as a flaw.

... In both cases insignificant trailing zeros are removed from the significand, and the decimal point is also removed if there are no remaining digits following it, unless the '#' option is used. ...

Any particular reason why the scientific notation is not used? 'e'

Scientific notation. For a given precision p, formats the number in scientific notation with the letter ‘e’ separating the coefficient from the exponent. The coefficient has one digit before and p digits after the decimal point, for a total of p + 1 significant digits. With no precision given, uses a precision of 6 digits after the decimal point for float, and shows all coefficient digits for Decimal. If no digits follow the decimal point, the decimal point is also removed unless the # option is used.

tarcisiofischer commented 2 years ago

Any particular reason why the scientific notation is not used? 'e'

I don't remember :( Possibly it was truncating values? Really don't know. @tadeu @arthursoprana maybe you remember?

Anyway, AFAIU, the problem is that those values are then read back as "integers", so the atol and rtol will not work anymore, is that it?

It may be ok to change it as you suggested.

IF we run into any troubles, another suggestion (and I'm unsure if it helps) would be to, instead of relying on the data type from the CSV file (which would be mistakenly read as int in this case), we use the dtype passed as parameter to the check() function. To be more clear, in this code, instead or using if np.issubdtype(obtained_column.values.dtype, np.inexact):, we could try using if np.issubdtype(expected_column.values.dtype, np.inexact):, because (I hope) the expected_column carries the desired data type. Let me know if that makes sense (I didn't try)

arthursoprana commented 2 years ago

It seems that a proposed suggestion would be to change this part of the code to

data_object.to_csv(
    str(filename),
    float_format=f"%.{DataFrameRegressionFixture.DISPLAY_PRECISION}e",
)

I cannot recall any particular reason why %.{prec}g is being used instead of %.{prec}e. Perhaps we could do that?

I haven't tested, but it seems that this would not break current tests except that if anyone tries to regenerate all expected values, then there will be a diff related to the change from g to e, right?

HikingDev commented 2 years ago

@tarcisiofischer the fix, changing to what @arthursoprana posted, fixes my issues. My current workaround is to convert Integers to Floats. The suggested fix serializes and deserializes correctly. This means that also rtol and atol work for me.

However i still think that the API should be extended to provide tolerances for Integers (exact type).

import pandas as pd

>>> df = pd.DataFrame({
...                         'col_1': [1.0, 2.0, 3.0],          
...                         'col_2': [21.0, 31.0, 32.0],     
...                         'col_3': [1010, 1020, 1030]  
... })
>>> df
   col_1  col_2  col_3
0    1.0   21.0   1010
1    2.0   31.0   1020
2    3.0   32.0   1030
>>> df.dtypes
col_1    float64
col_2    float64
col_3      int64

>>> df = pd.read_csv('test.csv')
>>> df
   Unnamed: 0  col_1  col_2  col_3
0           0    1.0   21.0   1010

>>> df.dtypes
Unnamed: 0      int64
col_1         float64
col_2         float64
col_3           int64
dtype: object
tarcisiofischer commented 2 years ago

@tarcisiofischer the fix, changing to what @arthursoprana posted, fixes my issues.

Awesome! :D Could you please open an MR with this plus a test? (Test could do exactly what you are doing, check the dtypes. Should fail with previous implementation).

However i still think that the API should be extended to provide tolerances for Integers (exact type).

Agree. Are you planning tackling on this also? Do you agree with me that this shouldn't be done using np.isclose as implementation?

Thanks for your efforts so far :D

HikingDev commented 2 years ago

@tarcisiofischer I am unsure, but I will go with what the library maintainers prefer ;-)

I was digging into similar discussions and came across a proposal for Python PEP 485 – A Function for testing approximate equality which includes the section to support non float types as well with isclose().

Also the numpy.isclose() documentation mentions in the last paragraph that:

isclose is not defined for non-numeric data types. bool is considered a numeric data-type for this purpose.

which suggests that isclose() is also intended for integers. Regarding the edge cases you have mentioned, shouldnt the developer handle overflow errors?

https://numpy.org/doc/stable/user/basics.types.html#overflow-errors

I will try to provide a testcase before Wednesday (when I go on vacation). Cant promise though ;-)

If the sentiment for Integer Tolerances is towards the extension of the API, I will see what I can come up with (in about 4 weeks of time)

tadeu commented 2 years ago

I cannot recall any particular reason why %.{prec}g is being used instead of %.{prec}e. Perhaps we could do that?

I think that %g was preferred because it's more readable/"human-friendly".

I haven't tested, but it seems that this would not break current tests except that if anyone tries to regenerate all expected values, then there will be a diff related to the change from g to e, right?

Yes, it looks like this is what would happen, but in order to avoid the noise for regenerated values my opinion is that it would be better to just solve this issue with something similar to #96 (but with tests). With that we would also add support for true integer datasets.

tarcisiofischer commented 2 years ago

Well, if that's the feeling I think you may just create some tests for the other issues and drop the idea of keeping the data types. I still think this is not a good approach, but I won't oppose since there are more people voting for #96 solution. I'm sorry for all the noise

Just to be clear: I prefer the changes for %g, as they solve a bug, and I think handling integers with np.isclose is wrong, because the values will be interpreted as floats, you'll have floating point comparison and may end up in unexpected situations.

Sorry you had to go through all this @HikingDev - I sincerely was just trying to move torwards what I thought was a correct direction,

tadeu commented 2 years ago

However i still think that the API should be extended to provide tolerances for Integers (exact type).

I agree on this. Let's just take care so that it's consistent with the way "closeness" is checked for float's. The simplest way seems to be to just roll out our own isclose for integers using the same asymmetric definition that numpy uses:

def isclose(a: int | float, b: int | float, atol: float, rtol: float) -> bool:
    return abs(a - b) <= atol + rtol * abs(b)

(Note that this is still not 100% precise because of implicit conversions to floating-point, but it seems OK even for very large integers and reasonable atol and rtol values -- a 100% precise implementation would require the conversion of rtol and atol to integer fractions)

tadeu commented 2 years ago

Just to be clear: I prefer the changes for %g, as they solve a bug, and I think handling integers with np.isclose is wrong.

My point is that if integers are handled correctly, then we don't need to change %g to %e, which would introduce unnecessary noise to changesets when using --force-regen, do you agree?

96 would already fix that for most cases, a custom isclose for integers would also make it behave better for large integers (a nice new feature, but it could be done separately if you wish).

tarcisiofischer commented 2 years ago

I'm unsure.

(a) Someone saves pd.DataFrame({'col_1': [1.0, 2.0, 3.0], }) those are meant to be floats.

(b) Someone loads it again. Since this PR is not merged, they are now integers.

(c) Since they are integers, they'll be handled as integer comparison. (c.1) IF you decide to use np.isclose, then integers will be converted back to float. I find this path weird, but perhaps it is guarantee that no bugs will arise from that. But then again, having atol and rtol being floats is weird for integer comparison or (c.2) IF you decide to handle integers in a different way, then your floats are now saved as integers and integer comparison apply, which (for me) would be unexpected.

I'm really unsure if I'm missing something here. I tried to put the path "on paper" to check it myself, but isn't it what it would happen?

Perhaps (c.1) is the best path to take (and is what was been tried to be merged in the first place). If all agree on that, I won't oppose - I just wanted to be sure edge cases would be handled correctly, such as if I have numbers that are not representable as integers, or the other way around, or if I have tolerances that may truncate up vs truncate down when comparing to integers, etc.

tadeu commented 2 years ago

I think that if is isclose is consistent for float's and int's, then we shouldn't have the (c.2) problem. It doesn't matter that the "integer floats" got converted to int, they would be compared in the same way as if they were float.

I don't find having atol and rtol being floats that weird, it seems to be very practical, you can check that 10**30+1 and 10**30 are within a 1e-30 relative tolerance or 1.0 absolute tolerance without problems.

I agree with most of your points, they are all nice contributions! My main concerns here are on backward compatibility/consistency, and to avoid the noise on modifying all datasets.

tarcisiofischer commented 2 years ago

As I said, I won't oppose.

Just for the record, this may be an example of a corner case (I may be wrong, I did it quick just to check):

If dtype is float:

>>> np.isclose([100000000000000000000.], [100000000000000000001.], atol=1)
array([ True])

But, since it is loaded as int:

np.isclose([100000000000000000000], [100000000000000000001], atol=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 180, in isclose
  File "[..]numeric.py", line 2372, in isclose
    xfin = isfinite(x)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

From practical perspective, this may be minor. I won't bother you further :)

tadeu commented 2 years ago

@tarcisiofischer, check it with the function I suggested on https://github.com/ESSS/pytest-regressions/issues/97#issuecomment-1165589084

(the function would need to be changed adding an all() and iteration to account for arrays/sequences)

One of my points is that the comparison between 100000000000000000000 and 100000000000000000001 is already not supported, so it's a new feature and it could be something done separately from the fix for the current problem (if that's a burden on the implementation, but I also don't oppose on doing these things together).

relativistic commented 2 years ago

Just realized that this is probably the cause of a problem I have with one of my tests being flaky. Any news on when a fix might be finished? Looks like the conversation died a while back. Or did this get resolved elsewhere somehow?

Thanks!

kdebrab commented 3 days ago

In the past month I have twice encountered flaky tests (related with rounding differences between windows and numpy) in two different projects where the expected value is a float, but the observed turned out to be an integer, e.g.:

E   obtained_x          expected_x               diff
E       19        19.00000000000001066    0.00000000000001066
E       19        19.00000000000001066    0.00000000000001066

This test failure occurred even though the tolerance was set to default_tolerance={"atol": 1e-3}.

The failure happens because of the combination of two root causes: (1) the conversion of floats into integers due to the use of %.{prec}g instead of %.{prec}e (see above), and (2) checking only the dtype of the obtained_column in https://github.com/ESSS/pytest-regressions/blob/master/src/pytest_regressions/dataframe_regression.py#L131:

if np.issubdtype(obtained_column.values.dtype, np.inexact):

Fixing the second root cause is straightforward and I think much less problematic with regard to backward compatibility/consistency than fixing the first root cause, while being sufficient to fix those flaky tests:

if np.issubdtype(obtained_column.values.dtype, np.inexact) or np.issubdtype(expected_column.values.dtype, np.inexact):

Of course, this fix still won't be sufficient if atol >= 1 AND both obtained and expected are integers, but I think it's sufficient for all other cases.

Alternatively, one can also simply test for Number instead of floats, as suggested in https://github.com/ESSS/pytest-regressions/issues/92, which I think would solve all cases:

np.issubdtype(obtained_column.values.dtype, np.number):