100trillionUSD / bitcoin

258 stars 65 forks source link

Pandas implementation #11

Closed FrancisBehnen closed 3 years ago

FrancisBehnen commented 3 years ago

You mentioned, you had an error. I'm not getting it with this code, can you check you still have it with this code? https://gist.github.com/FrancisBehnen/df215330de29e6969dec8d69658e2621

FrancisBehnen commented 3 years ago

I've got a very efficient analyzing algo for once the data is in a dataframe https://github.com/100trillionUSD/bitcoin/blob/4ddf7de5bedd696db41578e0aa8875889f18b015/pandas5.py . Unfortunately pandas has a hard time reading the file. So far I haven't found a way to resolve that :/

FrancisBehnen commented 3 years ago

pandas6.py is the same speed, but ironically it uses almost your algo. Maybe a pandas wizard knows a trick to greatly improve the reading speed, but I'm out of ideas..

FrancisBehnen commented 3 years ago

Pandas has a hard time working with very sparse matrices apparently (density ~0.2% in the test file). Dask can read sparse tables directly into a sparse data frame, but is even slower. Even tried numpy.genfromtxt(), but also to no avail. Can't handle ragged csv's at all.

Closing this PR