apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.46k stars 3.52k forks source link

Compatibility with xgboost #17404

Closed asfimport closed 7 years ago

asfimport commented 7 years ago

Traditionally I work with CSV's and really suffer with slow read/write times. Parquet and the Arrow project obviously give us huge speedups.

One thing I've noticed, however, is that there is a serious bottleneck when converting a DataFrame read in through pyarrow to a DMatrix used by xgboost. For example, I'm building a model with about 180k rows and 6k float64 columns. Reading into a pandas DataFrame takes about 20 seconds on my machine. However, converting that DataFrame to a DMatrix takes well over 10 minutes.

Interestingly, it takes about 10 minutes to read that same data from a CSV into a pandas DataFrame. Then, it takes less than a minute to convert to a DMatrix.

I'm sure there's a good technical explanation for why this happens (e.g. row vs column storage). Still, I imagine this use case may occur to many and it would be great to improve these times, if possible.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import xgboost as xgb
# Reading from parquet:
table = pq.read_table('/path/to/parquet/files')  # 20 seconds
variables = table.to_pandas()  # 1 second
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # takes 10-15 minutes
# Reading from CSV:
variables = pd.read_csv('/path/to/file.csv', ...)  # takes about 10 minutes
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # less than 1 minute

Reporter: Steven Anton

Note: This issue was originally created as ARROW-1374. Please see the migration documentation for further details.

asfimport commented 7 years ago

Wes McKinney / @wesm: Can you provide any profile output from cProfile to give some insight? I note that you are using the drop method – this is very inefficient because it produces a copy of the data. You could make this much faster by dropping the column from the Arrow table before calling to_pandas. The API for this is not very convenient at the moment but it could be made so cc @jreback @cpcloud

asfimport commented 7 years ago

Steven Anton: Thanks! Ok, I got the conversion time down to 52 seconds by following your suggestions. It sounds like there aren't really any changes needed. I had just misunderstood what was going on under the hood. Maybe some documentation or a "gotchas" section?

Here's what I did differently:

schema = pq.read_schema('/path/to/file.parq')
columns = [x for x in schema.names if x not in ['tag', ...]]
variables = pq.read_table(str(parquet_path), columns=columns).to_pandas()
# It looks like I could have used the .remove_column method instead of the above
tag = pq.read_table(str(parquet_path), columns=['is_fraud']).to_pandas()
cProfile.run('xgb.DMatrix(variables, label=tag)')

And the output of cProfile is below, just for reference.

         145353 function calls (145344 primitive calls) in 51.970 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        7    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
        1    0.030    0.030   51.970   51.970 <string>:1(<module>)
        4    0.000    0.000    0.002    0.000 __init__.py:357(__getattr__)
        4    0.002    0.000    0.002    0.000 __init__.py:364(__getitem__)
        1    0.000    0.000    0.000    0.000 __init__.py:483(cast)
        1    0.000    0.000    0.000    0.000 _internal.py:225(__init__)
        1    0.000    0.000    0.000    0.000 _internal.py:237(data_as)
        1    0.000    0.000    0.000    0.000 _methods.py:37(_any)
        2    0.000    0.000    0.000    0.000 algorithms.py:1342(_get_take_nd_function)
        2    0.000    0.000    0.001    0.000 algorithms.py:1375(take_nd)
        2    0.000    0.000    0.000    0.000 base.py:1578(is_all_dates)
        1    0.000    0.000    0.114    0.114 base.py:1884(format)
        1    0.000    0.000    0.114    0.114 base.py:1899(_format_with_header)
        1    0.010    0.010    0.113    0.113 base.py:1910(<listcomp>)
        2    0.000    0.000    0.000    0.000 base.py:3999(_ensure_index)
        7    0.000    0.000    0.000    0.000 base.py:528(__len__)
        3    0.000    0.000    0.000    0.000 base.py:559(values)
        2    0.000    0.000    0.000    0.000 cast.py:759(maybe_castable)
        2    0.000    0.000    0.000    0.000 cast.py:868(maybe_cast_to_datetime)
        2    0.000    0.000    0.000    0.000 common.py:117(is_sparse)
        1    0.000    0.000    0.000    0.000 common.py:1419(is_string_like_dtype)
        2    0.000    0.000    0.000    0.000 common.py:1456(is_float_dtype)
        2    0.000    0.000    0.000    0.000 common.py:1549(is_extension_type)
        2    0.000    0.000    0.000    0.000 common.py:1673(_get_dtype)
       16    0.000    0.000    0.000    0.000 common.py:1722(_get_dtype_type)
        4    0.000    0.000    0.000    0.000 common.py:1852(pandas_dtype)
        6    0.000    0.000    0.000    0.000 common.py:190(is_categorical)
        9    0.000    0.000    0.000    0.000 common.py:222(is_datetimetz)
        7    0.000    0.000    0.000    0.000 common.py:296(is_datetime64_dtype)
       14    0.000    0.000    0.000    0.000 common.py:333(is_datetime64tz_dtype)
        5    0.000    0.000    0.000    0.000 common.py:371(is_timedelta64_dtype)
        1    0.000    0.000    0.000    0.000 common.py:406(is_period_dtype)
        3    0.000    0.000    0.000    0.000 common.py:439(is_interval_dtype)
        8    0.000    0.000    0.000    0.000 common.py:475(is_categorical_dtype)
        1    0.000    0.000    0.000    0.000 common.py:508(is_string_dtype)
        3    0.000    0.000    0.000    0.000 common.py:609(is_datetimelike)
        4    0.000    0.000    0.000    0.000 common.py:84(is_object_dtype)
        5    0.000    0.000    0.000    0.000 core.py:115(_check_call)
        1    0.000    0.000    0.000    0.000 core.py:152(c_str)
        1    0.150    0.150    0.150    0.150 core.py:157(c_array)
        1    0.000    0.000    6.442    6.442 core.py:168(_maybe_pandas_data)
     6026    0.008    0.000    0.008    0.000 core.py:175(<genexpr>)
        1    0.003    0.003    0.003    0.003 core.py:187(<listcomp>)
        1    0.000    0.000    0.001    0.001 core.py:194(_maybe_pandas_label)
        2    0.000    0.000    0.000    0.000 core.py:202(<genexpr>)
        1    0.000    0.000   51.890   51.890 core.py:222(__init__)
        1   27.752   27.752   45.272   45.272 core.py:309(_init_from_npy2d)
        1    0.049    0.049    0.050    0.050 core.py:323(__del__)
        1    0.001    0.001    0.152    0.152 core.py:368(set_float_info)
        1    0.000    0.000    0.152    0.152 core.py:414(set_label)
        2    0.000    0.000    0.001    0.000 core.py:501(num_col)
        1    0.004    0.004    0.021    0.021 core.py:557(feature_names)
     6026    0.007    0.000    0.015    0.000 core.py:576(<genexpr>)
    24100    0.004    0.000    0.004    0.000 core.py:577(<genexpr>)
        1    0.000    0.000    0.003    0.003 core.py:585(feature_types)
     6026    0.002    0.000    0.002    0.000 core.py:614(<genexpr>)
        1    0.000    0.000    0.000    0.000 dtypes.py:367(is_dtype)
        3    0.000    0.000    0.000    0.000 dtypes.py:489(is_dtype)
       26    0.000    0.000    0.000    0.000 dtypes.py:84(is_dtype)
        2    0.000    0.000    0.000    0.000 generic.py:117(__init__)
        2    0.000    0.000    0.000    0.000 generic.py:145(_validate_dtype)
        2    0.000    0.000    0.000    0.000 generic.py:3067(__getattr__)
        4    0.000    0.000    0.000    0.000 generic.py:3083(__setattr__)
        2    0.000    0.000    0.000    0.000 generic.py:3122(_protect_consolidate)
        2    0.000    0.000    0.000    0.000 generic.py:3132(_consolidate_inplace)
        2    0.000    0.000    0.000    0.000 generic.py:3135(f)
        2    0.000    0.000    0.000    0.000 generic.py:3214(as_matrix)
        2    0.000    0.000    0.000    0.000 generic.py:3256(values)
        2    0.000    0.000    0.002    0.001 generic.py:3298(dtypes)
        2    0.000    0.000    0.000    0.000 generic.py:416(_info_axis)
       25    0.000    0.000    0.000    0.000 generic.py:7(_check)
     6025    0.009    0.000    0.014    0.000 inference.py:396(is_sequence)
        2    0.000    0.000    0.000    0.000 internals.py:102(__init__)
        3    0.000    0.000    0.000    0.000 internals.py:154(internal_values)
        2    0.000    0.000    0.000    0.000 internals.py:160(get_values)
        2    0.000    0.000    0.000    0.000 internals.py:1838(__init__)
        6    0.000    0.000    0.000    0.000 internals.py:185(mgr_locs)
        2    0.000    0.000    0.000    0.000 internals.py:222(mgr_locs)
        2    0.000    0.000    0.000    0.000 internals.py:2683(make_block)
        2    0.000    0.000    0.000    0.000 internals.py:2824(ndim)
        2    0.000    0.000    0.000    0.000 internals.py:2864(_is_single_block)
        2    0.000    0.000    0.000    0.000 internals.py:2897(_get_items)
        2    0.000    0.000    0.001    0.000 internals.py:2917(get_dtypes)
        2    0.000    0.000    0.000    0.000 internals.py:2918(<listcomp>)
        2    0.000    0.000    0.000    0.000 internals.py:2986(__len__)
       20    0.000    0.000    0.000    0.000 internals.py:303(dtype)
        2    0.000    0.000    0.000    0.000 internals.py:3296(is_consolidated)
        2    0.000    0.000    0.000    0.000 internals.py:3438(as_matrix)
        2    0.000    0.000    0.000    0.000 internals.py:3560(consolidate)
        2    0.000    0.000    0.000    0.000 internals.py:4078(__init__)
       21    0.000    0.000    0.000    0.000 internals.py:4124(_block)
       18    0.000    0.000    0.000    0.000 internals.py:4194(dtype)
        3    0.000    0.000    0.000    0.000 internals.py:4221(internal_values)
        1    0.000    0.000    0.000    0.000 missing.py:119(_isnull_ndarraylike)
        1    0.000    0.000    0.000    0.000 missing.py:26(isnull)
        1    0.000    0.000    0.000    0.000 missing.py:47(_isnull_new)
     6025    0.029    0.000    0.104    0.000 printing.py:157(pprint_thing)
     6025    0.027    0.000    0.038    0.000 printing.py:186(as_escaped_unicode)
        3    0.000    0.000    0.000    0.000 series.py:1049(__iter__)
        2    0.000    0.000    0.001    0.000 series.py:139(__init__)
        2    0.000    0.000    0.000    0.000 series.py:284(_set_axis)
        2    0.000    0.000    0.000    0.000 series.py:2894(_sanitize_array)
        2    0.000    0.000    0.000    0.000 series.py:2911(_try_cast)
        2    0.000    0.000    0.000    0.000 series.py:310(_set_subtyp)
        2    0.000    0.000    0.000    0.000 series.py:320(name)
        2    0.000    0.000    0.000    0.000 series.py:324(name)
       18    0.000    0.000    0.000    0.000 series.py:331(dtype)
        3    0.000    0.000    0.000    0.000 series.py:384(_values)
        1    0.000    0.000    0.000    0.000 {built-in method _ctypes.POINTER}
        3    0.000    0.000    0.000    0.000 {built-in method _ctypes.byref}
        4    0.003    0.001    0.028    0.007 {built-in method builtins.all}
     6025    0.003    0.000    0.008    0.000 {built-in method builtins.any}
        1    0.000    0.000   51.970   51.970 {built-in method builtins.exec}
       30    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
    12085    0.018    0.000    0.018    0.000 {built-in method builtins.hasattr}
    36332    0.010    0.000    0.010    0.000 {built-in method builtins.isinstance}
       30    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
     6028    0.002    0.000    0.002    0.000 {built-in method builtins.iter}
6066/6057    0.001    0.000    0.001    0.000 {built-in method builtins.len}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.setattr}
        7    2.229    0.318    2.229    0.318 {built-in method numpy.core.multiarray.array}
        3    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.empty}
        2    0.000    0.000    0.000    0.000 {built-in method pandas._libs.algos.ensure_int64}
        2    0.000    0.000    0.000    0.000 {built-in method pandas._libs.algos.ensure_object}
        2    0.000    0.000    0.000    0.000 {built-in method pandas._libs.lib.is_datetime_array}
        1    0.000    0.000    0.000    0.000 {built-in method pandas._libs.lib.isscalar}
        1    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' objects}
        2    6.314    3.157    6.314    3.157 {method 'astype' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'encode' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    18075    0.009    0.000    0.009    0.000 {method 'replace' of 'str' objects}
        2   15.291    7.645   15.291    7.645 {method 'reshape' of 'numpy.ndarray' objects}
        4    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        3    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' objects}
        2    0.000    0.000    0.000    0.000 {pandas._libs.algos.take_1d_object_object}
        1    0.000    0.000    0.000    0.000 {pandas._libs.lib.isnullobj}
        1    0.000    0.000    0.000    0.000 {pandas._libs.lib.maybe_convert_objects}
asfimport commented 7 years ago

Wes McKinney / @wesm: I opened ARROW-1388 about adding a drop convenience method so you don't have to call read_table twice. Documentation about this kind of stuff would be helpful for new users because Arrow's semantics re: copying (as in, you have to be quite explicit about when you copy data or allocate memory)

asfimport commented 7 years ago

Steven Anton: Thanks Wes – really appreciate your support on this awesome project.

Regarding the latter, I wonder if a quick feature to add could be a log message whenever a copy is made (kind of like how pandas warns when trying to assign values to a slice).

asfimport commented 7 years ago

Wes McKinney / @wesm: Maybe the closest thing we've discussed to this would be logging output when memory is allocated to give better transparency, I opened ARROW-1405. I recall discussing this in the past but I couldn't find an older JIRA about it so I opened a new one