Closed markqiu closed 4 years ago
hey @markqiu I can see the code spent >95% in the arrow table -> pandas dataframe conversion. Unfortunately this can't really be fixed in flight. Perhaps there is some improvement in the 12 seconds it took flight to move the data but that would depend on how many rows are in your dataset?
@rymurr Thank you for the reply. My dataset is [6375362 rows x 16 columns]
Hey @markqiu doesn't seem that large. What is the timing look like if you don't run print(df.to_pandas())
in your above test. ~12 seconds for 6.3Mx12 rows is maybe a bit slow however I think the bottleneck is really in the arrow -> pandas conversion (a known issue in pandas)
https://wesmckinney.com/blog/high-perf-arrow-to-pandas/ It's really not as fast as described in the article. So any hints to solve it?
Do you have a lot of string objects in your dataset? Wes' example was with 1 Billion doubles, strings will be significantly slower.
Yes, I do have some character fields.
I expect that to be the issue. I ran Wes' benchmark from above using random strings and it was significantly slower
Description
I tried the example found the speed is slow. I don't know why.
The test code:
The following it the profile result:
Please help me, thank you!