Closed asfimport closed 7 years ago
Wes McKinney / @wesm:
Can you provide any profile output from cProfile to give some insight? I note that you are using the drop
method – this is very inefficient because it produces a copy of the data. You could make this much faster by dropping the column from the Arrow table before calling to_pandas
. The API for this is not very convenient at the moment but it could be made so cc @jreback @cpcloud
Steven Anton: Thanks! Ok, I got the conversion time down to 52 seconds by following your suggestions. It sounds like there aren't really any changes needed. I had just misunderstood what was going on under the hood. Maybe some documentation or a "gotchas" section?
Here's what I did differently:
schema = pq.read_schema('/path/to/file.parq')
columns = [x for x in schema.names if x not in ['tag', ...]]
variables = pq.read_table(str(parquet_path), columns=columns).to_pandas()
# It looks like I could have used the .remove_column method instead of the above
tag = pq.read_table(str(parquet_path), columns=['is_fraud']).to_pandas()
cProfile.run('xgb.DMatrix(variables, label=tag)')
And the output of cProfile is below, just for reference.
145353 function calls (145344 primitive calls) in 51.970 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
7 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
1 0.030 0.030 51.970 51.970 <string>:1(<module>)
4 0.000 0.000 0.002 0.000 __init__.py:357(__getattr__)
4 0.002 0.000 0.002 0.000 __init__.py:364(__getitem__)
1 0.000 0.000 0.000 0.000 __init__.py:483(cast)
1 0.000 0.000 0.000 0.000 _internal.py:225(__init__)
1 0.000 0.000 0.000 0.000 _internal.py:237(data_as)
1 0.000 0.000 0.000 0.000 _methods.py:37(_any)
2 0.000 0.000 0.000 0.000 algorithms.py:1342(_get_take_nd_function)
2 0.000 0.000 0.001 0.000 algorithms.py:1375(take_nd)
2 0.000 0.000 0.000 0.000 base.py:1578(is_all_dates)
1 0.000 0.000 0.114 0.114 base.py:1884(format)
1 0.000 0.000 0.114 0.114 base.py:1899(_format_with_header)
1 0.010 0.010 0.113 0.113 base.py:1910(<listcomp>)
2 0.000 0.000 0.000 0.000 base.py:3999(_ensure_index)
7 0.000 0.000 0.000 0.000 base.py:528(__len__)
3 0.000 0.000 0.000 0.000 base.py:559(values)
2 0.000 0.000 0.000 0.000 cast.py:759(maybe_castable)
2 0.000 0.000 0.000 0.000 cast.py:868(maybe_cast_to_datetime)
2 0.000 0.000 0.000 0.000 common.py:117(is_sparse)
1 0.000 0.000 0.000 0.000 common.py:1419(is_string_like_dtype)
2 0.000 0.000 0.000 0.000 common.py:1456(is_float_dtype)
2 0.000 0.000 0.000 0.000 common.py:1549(is_extension_type)
2 0.000 0.000 0.000 0.000 common.py:1673(_get_dtype)
16 0.000 0.000 0.000 0.000 common.py:1722(_get_dtype_type)
4 0.000 0.000 0.000 0.000 common.py:1852(pandas_dtype)
6 0.000 0.000 0.000 0.000 common.py:190(is_categorical)
9 0.000 0.000 0.000 0.000 common.py:222(is_datetimetz)
7 0.000 0.000 0.000 0.000 common.py:296(is_datetime64_dtype)
14 0.000 0.000 0.000 0.000 common.py:333(is_datetime64tz_dtype)
5 0.000 0.000 0.000 0.000 common.py:371(is_timedelta64_dtype)
1 0.000 0.000 0.000 0.000 common.py:406(is_period_dtype)
3 0.000 0.000 0.000 0.000 common.py:439(is_interval_dtype)
8 0.000 0.000 0.000 0.000 common.py:475(is_categorical_dtype)
1 0.000 0.000 0.000 0.000 common.py:508(is_string_dtype)
3 0.000 0.000 0.000 0.000 common.py:609(is_datetimelike)
4 0.000 0.000 0.000 0.000 common.py:84(is_object_dtype)
5 0.000 0.000 0.000 0.000 core.py:115(_check_call)
1 0.000 0.000 0.000 0.000 core.py:152(c_str)
1 0.150 0.150 0.150 0.150 core.py:157(c_array)
1 0.000 0.000 6.442 6.442 core.py:168(_maybe_pandas_data)
6026 0.008 0.000 0.008 0.000 core.py:175(<genexpr>)
1 0.003 0.003 0.003 0.003 core.py:187(<listcomp>)
1 0.000 0.000 0.001 0.001 core.py:194(_maybe_pandas_label)
2 0.000 0.000 0.000 0.000 core.py:202(<genexpr>)
1 0.000 0.000 51.890 51.890 core.py:222(__init__)
1 27.752 27.752 45.272 45.272 core.py:309(_init_from_npy2d)
1 0.049 0.049 0.050 0.050 core.py:323(__del__)
1 0.001 0.001 0.152 0.152 core.py:368(set_float_info)
1 0.000 0.000 0.152 0.152 core.py:414(set_label)
2 0.000 0.000 0.001 0.000 core.py:501(num_col)
1 0.004 0.004 0.021 0.021 core.py:557(feature_names)
6026 0.007 0.000 0.015 0.000 core.py:576(<genexpr>)
24100 0.004 0.000 0.004 0.000 core.py:577(<genexpr>)
1 0.000 0.000 0.003 0.003 core.py:585(feature_types)
6026 0.002 0.000 0.002 0.000 core.py:614(<genexpr>)
1 0.000 0.000 0.000 0.000 dtypes.py:367(is_dtype)
3 0.000 0.000 0.000 0.000 dtypes.py:489(is_dtype)
26 0.000 0.000 0.000 0.000 dtypes.py:84(is_dtype)
2 0.000 0.000 0.000 0.000 generic.py:117(__init__)
2 0.000 0.000 0.000 0.000 generic.py:145(_validate_dtype)
2 0.000 0.000 0.000 0.000 generic.py:3067(__getattr__)
4 0.000 0.000 0.000 0.000 generic.py:3083(__setattr__)
2 0.000 0.000 0.000 0.000 generic.py:3122(_protect_consolidate)
2 0.000 0.000 0.000 0.000 generic.py:3132(_consolidate_inplace)
2 0.000 0.000 0.000 0.000 generic.py:3135(f)
2 0.000 0.000 0.000 0.000 generic.py:3214(as_matrix)
2 0.000 0.000 0.000 0.000 generic.py:3256(values)
2 0.000 0.000 0.002 0.001 generic.py:3298(dtypes)
2 0.000 0.000 0.000 0.000 generic.py:416(_info_axis)
25 0.000 0.000 0.000 0.000 generic.py:7(_check)
6025 0.009 0.000 0.014 0.000 inference.py:396(is_sequence)
2 0.000 0.000 0.000 0.000 internals.py:102(__init__)
3 0.000 0.000 0.000 0.000 internals.py:154(internal_values)
2 0.000 0.000 0.000 0.000 internals.py:160(get_values)
2 0.000 0.000 0.000 0.000 internals.py:1838(__init__)
6 0.000 0.000 0.000 0.000 internals.py:185(mgr_locs)
2 0.000 0.000 0.000 0.000 internals.py:222(mgr_locs)
2 0.000 0.000 0.000 0.000 internals.py:2683(make_block)
2 0.000 0.000 0.000 0.000 internals.py:2824(ndim)
2 0.000 0.000 0.000 0.000 internals.py:2864(_is_single_block)
2 0.000 0.000 0.000 0.000 internals.py:2897(_get_items)
2 0.000 0.000 0.001 0.000 internals.py:2917(get_dtypes)
2 0.000 0.000 0.000 0.000 internals.py:2918(<listcomp>)
2 0.000 0.000 0.000 0.000 internals.py:2986(__len__)
20 0.000 0.000 0.000 0.000 internals.py:303(dtype)
2 0.000 0.000 0.000 0.000 internals.py:3296(is_consolidated)
2 0.000 0.000 0.000 0.000 internals.py:3438(as_matrix)
2 0.000 0.000 0.000 0.000 internals.py:3560(consolidate)
2 0.000 0.000 0.000 0.000 internals.py:4078(__init__)
21 0.000 0.000 0.000 0.000 internals.py:4124(_block)
18 0.000 0.000 0.000 0.000 internals.py:4194(dtype)
3 0.000 0.000 0.000 0.000 internals.py:4221(internal_values)
1 0.000 0.000 0.000 0.000 missing.py:119(_isnull_ndarraylike)
1 0.000 0.000 0.000 0.000 missing.py:26(isnull)
1 0.000 0.000 0.000 0.000 missing.py:47(_isnull_new)
6025 0.029 0.000 0.104 0.000 printing.py:157(pprint_thing)
6025 0.027 0.000 0.038 0.000 printing.py:186(as_escaped_unicode)
3 0.000 0.000 0.000 0.000 series.py:1049(__iter__)
2 0.000 0.000 0.001 0.000 series.py:139(__init__)
2 0.000 0.000 0.000 0.000 series.py:284(_set_axis)
2 0.000 0.000 0.000 0.000 series.py:2894(_sanitize_array)
2 0.000 0.000 0.000 0.000 series.py:2911(_try_cast)
2 0.000 0.000 0.000 0.000 series.py:310(_set_subtyp)
2 0.000 0.000 0.000 0.000 series.py:320(name)
2 0.000 0.000 0.000 0.000 series.py:324(name)
18 0.000 0.000 0.000 0.000 series.py:331(dtype)
3 0.000 0.000 0.000 0.000 series.py:384(_values)
1 0.000 0.000 0.000 0.000 {built-in method _ctypes.POINTER}
3 0.000 0.000 0.000 0.000 {built-in method _ctypes.byref}
4 0.003 0.001 0.028 0.007 {built-in method builtins.all}
6025 0.003 0.000 0.008 0.000 {built-in method builtins.any}
1 0.000 0.000 51.970 51.970 {built-in method builtins.exec}
30 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
12085 0.018 0.000 0.018 0.000 {built-in method builtins.hasattr}
36332 0.010 0.000 0.010 0.000 {built-in method builtins.isinstance}
30 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
6028 0.002 0.000 0.002 0.000 {built-in method builtins.iter}
6066/6057 0.001 0.000 0.001 0.000 {built-in method builtins.len}
4 0.000 0.000 0.000 0.000 {built-in method builtins.setattr}
7 2.229 0.318 2.229 0.318 {built-in method numpy.core.multiarray.array}
3 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.empty}
2 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_int64}
2 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_object}
2 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_datetime_array}
1 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.isscalar}
1 0.000 0.000 0.000 0.000 {method 'any' of 'numpy.ndarray' objects}
2 6.314 3.157 6.314 3.157 {method 'astype' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'encode' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
18075 0.009 0.000 0.009 0.000 {method 'replace' of 'str' objects}
2 15.291 7.645 15.291 7.645 {method 'reshape' of 'numpy.ndarray' objects}
4 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
3 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray' objects}
2 0.000 0.000 0.000 0.000 {pandas._libs.algos.take_1d_object_object}
1 0.000 0.000 0.000 0.000 {pandas._libs.lib.isnullobj}
1 0.000 0.000 0.000 0.000 {pandas._libs.lib.maybe_convert_objects}
Wes McKinney / @wesm: I opened ARROW-1388 about adding a drop convenience method so you don't have to call read_table twice. Documentation about this kind of stuff would be helpful for new users because Arrow's semantics re: copying (as in, you have to be quite explicit about when you copy data or allocate memory)
Steven Anton: Thanks Wes – really appreciate your support on this awesome project.
Regarding the latter, I wonder if a quick feature to add could be a log message whenever a copy is made (kind of like how pandas warns when trying to assign values to a slice).
Wes McKinney / @wesm: Maybe the closest thing we've discussed to this would be logging output when memory is allocated to give better transparency, I opened ARROW-1405. I recall discussing this in the past but I couldn't find an older JIRA about it so I opened a new one
Traditionally I work with CSV's and really suffer with slow read/write times. Parquet and the Arrow project obviously give us huge speedups.
One thing I've noticed, however, is that there is a serious bottleneck when converting a DataFrame read in through pyarrow to a DMatrix used by xgboost. For example, I'm building a model with about 180k rows and 6k float64 columns. Reading into a pandas DataFrame takes about 20 seconds on my machine. However, converting that DataFrame to a DMatrix takes well over 10 minutes.
Interestingly, it takes about 10 minutes to read that same data from a CSV into a pandas DataFrame. Then, it takes less than a minute to convert to a DMatrix.
I'm sure there's a good technical explanation for why this happens (e.g. row vs column storage). Still, I imagine this use case may occur to many and it would be great to improve these times, if possible.
Reporter: Steven Anton
Note: This issue was originally created as ARROW-1374. Please see the migration documentation for further details.