Closed roll closed 4 years ago
Totals | |
---|---|
Change from base Build 387: | 0.2% |
Covered Lines: | 1647 |
Relevant Lines: | 1947 |
@akariv
@cschloer
This one is more like a request for a discussion than a PR because this issue - https://github.com/BCODMO/frictionless-usecases/issues/21 - is kinda tricky. As I see sort_rows
is fully string-based processor comparison-wise.
Consider it as POC for handling numbers correctly but I'm not sure about performance and maybe there is a better way to achieve the same goal (e.g. mutating the key
once instead of every value).
WDYT? Any ideas?)
PS. One of the existent tests was written with the understanding that the string comparison is going to happen for numbers. This PR updates it.
The initial plan was too naive. Let's see what we can do else..
@akariv
It seems I made it work but it looks very complicated, not fully proofed and will have some limitations (mistakes in comparing big numbers like 10000000001
etc)
TBH I'm not sure what's direction is the most promising...
I thought about different approaches using real number comparison and all of them seems to be much less effective memory or performance-wise. We probably can use a KVFile
modification with a numeric key but it's a limited case for one column ordering.
Maybe we can add an in-memory sorter processor or mode so the client will be able to choose real numbers sort knowing that his data is not too large.
WDYT?
Here is an update from @akariv I'm going to use to update the PR:
If we use the binary representation of the float, it will ‘just work’
https://www.h-schmidt.net/FloatConverter/IEEE754.html and no need for expensive formatting / transformations just dump the float, as it’s in memory, to a hex string
https://stackoverflow.com/questions/16444726/binary-representation-of-float-in-python-bits-not-hex
@akariv Please take a look
The new solution based on binary arrays seems much nicer, passes the same tests and hopefully more performant and precise (for cases like 100000000000001).
Looks good @roll!
Do the tests cover the Decimal case as well?
@akariv I've added the real data file sorting which also covers decimals sorting