dremio-hub / dremio-flight-connector

Dremio Flight connector. Access Dremio using Arrow flight
40 stars 8 forks source link

why is the speed very slow? #11

Closed markqiu closed 4 years ago

markqiu commented 4 years ago

Description

I tried the example found the speed is slow. I don't know why.

The test code:

from pyarrow import flight
import pyarrow as pa

class HttpDremioClientAuthHandler(flight.ClientAuthHandler):

    def __init__(self, username, password):
        super(flight.ClientAuthHandler, self).__init__()
        self.basic_auth = flight.BasicAuth(username, password)
        self.token = None

    def authenticate(self, outgoing, incoming):
        auth = self.basic_auth.serialize()
        outgoing.write(auth)
        self.token = incoming.read()

    def get_token(self):
        return self.token

username = 'xx'
password = 'xxxx'
sql = '''SELECT * FROM "@admin".adj'''

@profile
def run():
    client = flight.connect("grpc://123.103.74.232:47470")
    client.authenticate(HttpDremioClientAuthHandler(username, password))
    fd = flight.FlightDescriptor.for_command(sql)
    fi = client.get_flight_info(fd)
    ticket = fi.endpoints[0].ticket
    df = client.do_get(ticket).read_all()
    print(df.to_pandas())

run()

The following it the profile result:

# python -m line_profiler flight.py.lprof
Timer unit: 1e-06 s
[6375362 rows x 16 columns]
Total time: 270.123 s
File: flight.py
Function: run at line 23

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    23                                           @profile
    24                                           def run():
    25         1        651.0    651.0      0.0      client = flight.connect("grpc://123.103.74.232:47470")
    26         1     193400.0 193400.0      0.1      client.authenticate(HttpDremioClientAuthHandler(username, password))
    27         1         11.0     11.0      0.0      fd = flight.FlightDescriptor.for_command(sql)
    28         1     304516.0 304516.0      0.1      fi = client.get_flight_info(fd)
    29         1         16.0     16.0      0.0      ticket = fi.endpoints[0].ticket
    30         1   12358970.0 12358970.0      4.6      df = client.do_get(ticket).read_all()
    31         1  257265552.0 257265552.0     95.2      print(df.to_pandas())

Please help me, thank you!

rymurr commented 4 years ago

hey @markqiu I can see the code spent >95% in the arrow table -> pandas dataframe conversion. Unfortunately this can't really be fixed in flight. Perhaps there is some improvement in the 12 seconds it took flight to move the data but that would depend on how many rows are in your dataset?

markqiu commented 4 years ago

@rymurr Thank you for the reply. My dataset is [6375362 rows x 16 columns]

rymurr commented 4 years ago

Hey @markqiu doesn't seem that large. What is the timing look like if you don't run print(df.to_pandas()) in your above test. ~12 seconds for 6.3Mx12 rows is maybe a bit slow however I think the bottleneck is really in the arrow -> pandas conversion (a known issue in pandas)

markqiu commented 4 years ago

https://wesmckinney.com/blog/high-perf-arrow-to-pandas/ It's really not as fast as described in the article. So any hints to solve it?

rymurr commented 4 years ago

Do you have a lot of string objects in your dataset? Wes' example was with 1 Billion doubles, strings will be significantly slower.

markqiu commented 4 years ago

Yes, I do have some character fields.

rymurr commented 4 years ago

I expect that to be the issue. I ran Wes' benchmark from above using random strings and it was significantly slower