datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
194 stars 39 forks source link

Fix sort to work correctly for numbers #111

Closed roll closed 4 years ago

roll commented 4 years ago
coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 392


Totals Coverage Status
Change from base Build 387: 0.2%
Covered Lines: 1647
Relevant Lines: 1947

💛 - Coveralls
roll commented 4 years ago

@akariv @cschloer This one is more like a request for a discussion than a PR because this issue - https://github.com/BCODMO/frictionless-usecases/issues/21 - is kinda tricky. As I see sort_rows is fully string-based processor comparison-wise.

Consider it as POC for handling numbers correctly but I'm not sure about performance and maybe there is a better way to achieve the same goal (e.g. mutating the key once instead of every value).

WDYT? Any ideas?)

PS. One of the existent tests was written with the understanding that the string comparison is going to happen for numbers. This PR updates it.

roll commented 4 years ago

The initial plan was too naive. Let's see what we can do else..

roll commented 4 years ago

@akariv It seems I made it work but it looks very complicated, not fully proofed and will have some limitations (mistakes in comparing big numbers like 10000000001 etc)

TBH I'm not sure what's direction is the most promising...

I thought about different approaches using real number comparison and all of them seems to be much less effective memory or performance-wise. We probably can use a KVFile modification with a numeric key but it's a limited case for one column ordering.

Maybe we can add an in-memory sorter processor or mode so the client will be able to choose real numbers sort knowing that his data is not too large.

WDYT?

roll commented 4 years ago

Here is an update from @akariv I'm going to use to update the PR:

If we use the binary representation of the float, it will ‘just work’

https://www.h-schmidt.net/FloatConverter/IEEE754.html and no need for expensive formatting / transformations just dump the float, as it’s in memory, to a hex string

https://stackoverflow.com/questions/16444726/binary-representation-of-float-in-python-bits-not-hex

roll commented 4 years ago

@akariv Please take a look

The new solution based on binary arrays seems much nicer, passes the same tests and hopefully more performant and precise (for cases like 100000000000001).

akariv commented 4 years ago

Looks good @roll!

Do the tests cover the Decimal case as well?

roll commented 4 years ago

@akariv I've added the real data file sorting which also covers decimals sorting