SethMMorton / natsort

Simple yet flexible natural sorting in Python.
https://pypi.org/project/natsort/
MIT License
907 stars 52 forks source link

Some values don't sort in a consistent order #149

Closed benrg closed 1 year ago

benrg commented 2 years ago

The output of natsorted depends on the input order in many cases:

>>> def sorted_both_ways(a, b, sorted=natsort.natsorted):
...     return sorted([a, b]), sorted([b, a])
...
>>> sorted_both_ways(1, '1')
([1, '1'], ['1', 1])
>>> sorted_both_ways('1', '01')
(['1', '01'], ['01', '1'])
>>> sorted_both_ways(float('-inf'), float('nan'))
([-inf, nan], [nan, -inf])
>>> sorted_both_ways(float('nan'), None)
([nan, None], [None, nan])

In the case of NaNs, there's a flag to sort them at the end instead of the beginning, but they don't sort at the end with it and don't sort at the beginning without it, even if the input is only floats. Code that depends on NaN sorting order is likely to check the 0th or -1th item of the output to see whether there are any NaNs, which leads to hard-to-find bugs with the current behavior.

In the case of '1' and '01', I don't think the order matters, but I think there should be an order. StrCmpLogicalW sorts 01 < 1 < 02 < 2; I don't claim that's the best order, but it's consistent and matches at least one widely used tool.

In the case of 1 and '1', my preference would be that it raise TypeError. People who pass the FLOAT flag are working around some horrible brokenness elsewhere in their system, and deserve all the bad things that happen to them, but I have a hard time believing that anyone needs/wants the current int-only behavior without the flag. The library already raises TypeError when sorting [1, b'1'], so it's not like it's aiming to work with arbitrary mixtures of heterogeneous values.

SethMMorton commented 2 years ago

In all four of your examples, under-the-hood natsort is treating the values as the same:

So, these are all demonstrating that python sorted uses a stable sort algorithm. If you want to ensure consistent ordering you could sort your input without natsort first, then sort with, and you should always get consistent results.

In the case of NaNs, there's a flag to sort them at the end instead of the beginning, but they don't sort at the end with it and don't sort at the beginning without it, even if the input is only floats.

Can you demonstrate that the flag doesn't change behavior?

The library already raises TypeError when sorting [1, b'1'], so it's not like it's aiming to work with arbitrary mixtures of heterogeneous values.

This is not true. Please check out https://github.com/SethMMorton/natsort#sorting-mixed-types. The fact that bytes are not handled as part of this is explicitly called out in https://github.com/SethMMorton/natsort#handling-bytes.

People who pass the FLOAT flag are working around some horrible brokenness elsewhere in their system, and deserve all the bad things that happen to them, but I have a hard time believing that anyone needs/wants the current int-only behavior without the flag.

I am not sure what you are trying to say here, but ideally it would have been said without wishing ill on people who use this library in a way you disagree with. There is a "be nice" policy here, I would appreciate if you would follow that guideline.

SethMMorton commented 2 years ago

I can think of easy ways to solve the NaN/None issue and would be open to fixing that, but the "1"/"01" issue is much more challenging. There is no easy way to solve this without using heuristics, and even then they would likely be fragile and slow down the sorting algorithm. Did you have a specific suggestion of how to solve the problem?

SethMMorton commented 1 year ago

In the next natsort release, the handling of None and NaN will be done automatically under-the-hood to always be consistently sorted. For the numbers case, this can be solved by adding ns.PRESORT, which pre-sorts your data (as suggested above).