Roche / pyreadstat

Python package to read sas, spss and stata files into pandas data frames. It is a wrapper for the C library readstat.
Other
325 stars 61 forks source link

Feature: Update binary read to Polars #184

Closed jccguma closed 2 years ago

jccguma commented 2 years ago

It would be interesting to be able to read spss files with Polars. Pyreadstat provides support to convert to Pandas. Could this be done with Polars? Polars is a great package and would help a lot speeding up.

https://github.com/pola-rs/polars

Thanks in advance.

ofajardo commented 2 years ago

yes, that's an interesting proposal. I don't have an answer if it is possible or not, I would need to get familiar with the polars API. What I have seen so far is that it is written in Rust, which may mean it would need completely another kind of interface, but I need to learn more about it, maybe there is a high level interface I can use without having to wrap up any Rust. Since all the library is written explicitly to transform to a pandas dataframe, I would also to see how much change is needed to accommodate the conversion to polars.

Another thing is that pyreadstat is wrapping the C library Readstat, you could think of doing a wrapper that works with polars, but the only thing that will bring is to avoid the conversion from a pandas dataframe to a polars dataframe, but since all the base code is in C anyway, I don't think it would bring any major speed improvement on the conversion itself.

So, in conclusion I will look at it slowly when I get a bit of time, but for the time being the solution is to convert the pandas dataframe you obtain from pyreadstat to a polars dataframe yourself.

jccguma commented 2 years ago

Thanks @ofajardo for your incredibly fast response. The issue of avoiding Pandas is to prevent slow performance when reading huge files. Native reading used with Polars is amazingly fast. Polars is even faster that Spark. I will be using huge spss files soon so this is the reason. I already requested support from Polars team but they suggested me to contact Pyreadstat team since your team could solve this to Pandas. Thanks again for your help.

ofajardo commented 2 years ago

Right, however as I mentioned before you would not get any major performance gain.

The reason is the following:

So you see why it will not bring too much? It has nothing to do with whatever process polars uses to read parket/CSV or whatever else, which probably is written in Rust, here we would be using everything as is except the last step.

Would that help? probably not too much right? because I am now seeing that you can call polars.DataFrame directly giving as argument a pandas dataframe ... so I would say that is the solution right now.

Now, one could thing that the second point, where I am transforming C structures to numpy arrays maybe could be replaced directly to polars stuff (arrow arrays?). Maybe. But that may need a completely different wrapper that talks to Rust (or maybe have to rewrite the wrapper to be in C++ instead of C so that it can talk to Arrow), so I will take a look, but probably would be a completely different library and not pyreadstat anymore, which means it is a lot of effort and therefore not going to happen anytime soon). It is also not clear what would the speed gain be. It also has the disadvantage that the process would be less flexible, what I mean by that is that I already was thinking when I wrote pyreadstat that maybe later there is another library that replaces pandas, then the only point I have to change is point 3. But a wrapper for polars stuff would be very specific and cannot be adapted to acomodate other outputs. For that reason I would be in favor of doing that.

If you know somebody in polars who can give more input or insight, you can invite her or him to participate in this conversation.

jccguma commented 2 years ago

Thanks a million for your kind help @ofajardo. I really thing that avoiding Pandas could make a real change, but I am not an expert. Following your recommendation I add @ritchie46 to this conversation. He suggested me to get in contact with Pyreadstats to comment the possibility of reading in Polars from spss. Please, @ritchie46, could you help us with this?

Thanks to you two in advance!

ofajardo commented 2 years ago

thanks @jccguma, let's see what the polar experts suggest.

Just to temper a bit your expectations, I just tried transforming a python dictionary of numpy arrays to either pandas or polars data frame (point 3 in my process) and I don't see any performance gain. Therefore, I don't see the benefit of implemeting that right now.

Regarding how point 2 could be changed to improve things, I am open to hearing at suggestions, but as it would be a major change, and it would remove flexibility from the library, I don't see that coming anytime soon. But also don't discard it for later if the usage of Arrow/polars grow and becomes the lingua franca of dataframes in python as at the moment (unfortunately?) pandas is.

ritchie46 commented 2 years ago

It could be that you can have faster pandas interop and more support for any tool that supports the arrow landscape.

Pandas has to consolidate the numpy arrays into the blockmanager, which is expensive. pyarrow often is able to write to pandas faster than that pandas itself can. See: https://ursalabs.org/blog/fast-pandas-loading/ written by the author of pandas and arrow itself.

Polars has zero copy interop with arrow, so that conversion would then be free. The arrow landscape is quite large, so it would include interop with a lot more tools.

jccguma commented 2 years ago

Thanks again @ofajardo. I agree with you, transforming a python dictionary of numpy arrays to either pandas or polars has not a real advantage. However, loading a 10 GB file or performing any calculations with such a file in pandas can be either amazingly slow or simply impossible (depending on the resources) whereas polars flies and provides a solution in minutes. This was the reason as a user to ask for the implementation in polars. I really appreciate your effort and kindness helping me with this.

jccguma commented 2 years ago

Thanks @ritchie46. I do not have a deep knowledge of how pandas or polars work under the hood, but I cannot deal with 10GB csv files with pandas whereas this is simply easy with polars. I am wondering if I will have the same issue when importing from spss. Does my suggestion make any sense? Just to avoid suggesting something useless for the polars community.

ofajardo commented 2 years ago

hey @ritchie46 thanks! I was not aware of this report, which indeed seems promising. I quickly reproduced what he suggests and indeed pyarrow is faster. However, if I use more data, just 5000 columns instead of 100, the effect reverses, with pandas and polar being faster than pyarrow (25s for pandas/polar vs 30 s for pyarrow). So I don't see much benefit there.

A different story is the interoperability, where I envision that in the future pyreadstat could offer arrow as output.

And yet another point is what @jccguma suggests, polars being more memory efficient than pandas, or being able to handle larger data, any comments there?

ofajardo commented 2 years ago

Another simpler solution for now is that I could introduce a new parameter output_format, where you could specify for now "pandas" (the default) or "dict". With dict the functions would give you as output a dictionary of numpy arrays instead of a pandas dataframe, than then you can convert to whatever you want. The advantage is that the third step (conversion of the dict to pandas dataframe) is skipped so that that time is saved, and hopefully it also helps with your memory issues.

I like this solution because for me it would be simpler, it is very flexible and it does not introduce any new dependency.

@jccguma what do you think?

ofajardo commented 2 years ago

Another different thought, without knowing anything about your workflow @jccguma, is, why don't you convert your files to whatever other format once, and from then onward just work with the converted format? then you can do the conversion on a batch process, so that you would not care about speed too much.

ritchie46 commented 2 years ago

Another simpler solution for now is that I could introduce a new parameter output_format, where you could specify for now "pandas" (the default) or "dict". With dict the functions would give you as output a dictionary of numpy arrays instead of a pandas dataframe, than then you can convert to whatever you want. The advantage is that the third step (conversion of the dict to pandas dataframe) is skipped so that that time is saved, and hopefully it also helps with your memory issues.

I like this solution because for me it would be simpler, it is very flexible and it does not introduce any new dependency.

@jccguma what do you think?

I think this would be great. That would remove the blockmanager consolidation that pandas had.

jccguma commented 2 years ago

Another different thought, without knowing anything about your workflow @jccguma, is, why don't you convert your files to whatever other format once, and from then onward just work with the converted format? then you can do the conversion on a batch process, so that you would not care about speed too much.

@ofajardo, not really sure if I can do that in the current system. I would be interested on doing it directly but I will try your approach also.

ofajardo commented 2 years ago

@ofajardo, not really sure if I can do that in the current system. I would be interested on doing it directly but I will try your approach also.

OK, I was suggesting that because that is the way we do over here, and as far as I know in other companies as well, honestly I don't see much advantage on keeping the data on a closed proprietary format (unless part of the group is using SPSS, but still, if files do not change, you shouldn't use spss as permanent storage, better use an open source format)

ofajardo commented 2 years ago

I think this would be great. That would remove the blockmanager consolidation that pandas had.

OK, if that resonates, that is also my preferred solution, so I will implement that.

jccguma commented 2 years ago

@ofajardo, you are right but the problem is that the client is willing to use it and we do not have much resources so the idea is refreshing with a periodic load instead of replicating the database with a batch process in another format.

jccguma commented 2 years ago

I think this would be great. That would remove the blockmanager consolidation that pandas had.

OK, if that resonates, that is also my preferred solution, so I will implement that.

@ofajardo and @ritchie46, thanks for your kind support on this issue. I really appreciate it.

ofajardo commented 2 years ago

I have been doing a few tests. Returning a dict of numpy arrays from pyreadstat is relatively easy. Once you have the dict you can call polars.DataFrame on it directly and the result looks good. So I guess that would be the way to go. You can also handle much largers datasets compared to pandas (2x in my hands). Only bad thing is that it is a bit slow (as slow as calling pandas.DataFrame on that dict, except that as you don't have to call pandas, you would save that time and invest it into converting the dict into polars).

Converting the dict into pyarrow and then converting that pyarrow table into a polars dataframe is much much faster for a dataframe of only floats (27 sec vs 0.07 sec, wow!), however it is problematic in other aspects. For example if I feed a dict with numpy arrays of type object, where all elements are let's say strings, or datetime.date, etc and one element is a np.nan (that is the way I am constructing my arrays right now, and changing that would be a breaking change, so I am reluctant to change that) raises an error, saying that it was expecting an object of type string, integer if it is a date, etc, and suddently got a float (the np.nan). Interestingly if you try to convert the numpy array of strings+np.nan into a pyarrow array then what you get are your strings + a string "nan" insetead of some kind of null value.

I need to get more familiar with pyarrow to be honest, my impression is that it is a lower level tool, so you have to be much more careful with what you pass into it, while polars probably is doing lots of checks on the data to transform it properly, so at the moment that's what you would need and therefore the solution would be to use polars.DataFrame(pyreadstat_dict)

ritchie46 commented 2 years ago

This is already great! We can improve our from_dict performance a lot. Until now, it was not really use for heavy load data injection in polars, so we didn't gave it much attention yet. We should be able to get approx. the same performance of pyarrow.

What kind of dict did you test it with? How many columns, and which column length?

ofajardo commented 2 years ago

For the speed tests I used 1M rows and 5K columns of type double, here the code

from time import time

import pandas as pd
import numpy as np
import polars as pl
import pyarrow as pa

num_rows = 1_000_000
num_columns = 5000
arr = np.random.randn(num_rows)
mydic = {
    'f{}'.format(i): arr
    for i in range(num_columns)
}

t0=time()
df = pl.DataFrame(mydic)
print("polars", time()-t0)
# 28 seconds
del df

t0=time()
df = pa.table(mydic)#.to_pandas()
print("arrow", time()-t0)
# 0.04 sec
del df

t0=time()
ar = pa.table(mydic)
df = pl.DataFrame(ar)
print("arrow+polars", time()-t0)
# 0.08 sec
del df
del ar

t0=time()
df = pd.DataFrame(mydic)
print("pandas", time()-t0)
# 29 sec

t0=time()
df = pa.table(mydic).to_pandas()
print("arrow+pandas", time()-t0)
# 31 sec
del df
jccguma commented 2 years ago

@ofajardo that looks fantastic! Thanks for your help! @ritchie46 I think this is great for Polars.

ofajardo commented 2 years ago

Naive question: I am trying a few things with polars, and I am getting lots of not implemented errors, for example:

(Pdb) df.to_pandas()
thread '<unnamed>' panicked at 'not implemented', /github/workspace/polars/polars-core/src/chunked_array/object/mod.rs:113:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
*** pyo3_runtime.PanicException: not implemented
polars..testing.assert_frame_equal(df1, df2)
Traceback (most recent call last):
  File "/home/otto/git_repos/mypub/pyreadstat/tests/test_basic.py", line 1037, in test_sav_outputformat_polars
    polars.testing.assert_frame_equal(left=df, right=parrow)
  File "/home/otto/miniconda3/lib/python3.9/site-packages/polars/testing.py", line 101, in assert_frame_equal
    _assert_series_inner(
  File "/home/otto/miniconda3/lib/python3.9/site-packages/polars/testing.py", line 184, in _assert_series_inner
    if (left != right).sum() != 0:
  File "/home/otto/miniconda3/lib/python3.9/site-packages/polars/internals/series.py", line 332, in __ne__
    return self._comp(other, "neq")
  File "/home/otto/miniconda3/lib/python3.9/site-packages/polars/internals/series.py", line 321, in _comp
    return wrap_s(getattr(self._s, op)(other._s))
pyo3_runtime.PanicException: not implemented

Is it just that these methods are not implemented yet, or am I doing something wrong? using polars 0.13.29

ritchie46 commented 2 years ago

It seems that the DataFrame has an Object dtype. This is a last resort data type for anything we don't know. Different from pandas, object types are not string types in polars.

An Object dtype is often unintended and unwanted behavior and indicates that we could not infer the proper dtype.

We use pyarrow to convert to pandas and this works for all data types except for Object as it is opaque.

Can you share your schema? And the payload that created the Object type?

ofajardo commented 2 years ago

This is how it can be reproduced, this data is part of my test data suite, so I was checking if after converting the spss data to a dict, and then passing it to polars my tests would pass. My guess is the ofending columns here are probably those with a datetime like object. In case you ware wondering why I return data like that instead of a pandas timestamp, it is because for us it is definitely not the same to have a variable with date or datetime as the resolution is not the same and that is important for our use cases, also time only is not the same as date or datetime.

import datetime

import numpy as np
import polars as pl

dicdata = dict([('mychar', np.array(['a', 'b', 'c', 'd', 'e'], dtype=object)), 
                          ('mynum', np.array([    1.1,     1.2, -1000.3,    -1.4,  1000.3])), 
                          ('mydate', np.array([datetime.date(2018, 5, 6), datetime.date(1880, 5, 6),
       datetime.date(1960, 1, 1), datetime.date(1583, 1, 1), np.nan],
      dtype=object)), 
                        ('dtime', np.array([datetime.datetime(2018, 5, 6, 10, 10, 10),
       datetime.datetime(1880, 5, 6, 10, 10, 10),
       datetime.datetime(1960, 1, 1, 0, 0),
       datetime.datetime(1583, 1, 1, 0, 0), np.nan], dtype=object)), 
                       ('mylabl', np.array([1., 2., 1., 2., 1.])), 
                      ('myord', np.array([1., 2., 3., 1., 1.])), 
                      ('mytime', np.array([datetime.time(10, 10, 10), datetime.time(23, 10, 10),
       datetime.time(0, 0), datetime.time(16, 10, 10), np.nan], dtype=object))])

df1 = pl.DataFrame(dicdata)
df2 = pl.DataFrame(dicdata)
pl.testing.assert_frame_equal(left=df1, right=df2)
# raises error
ofajardo commented 2 years ago

OK, on dev there is now a new parameter output_parameter, if you set it to 'dict' you get a dict of numpy arrays. Let's keep this issue open for future developments.

ritchie46 commented 2 years ago

Awesome! Thanks a lot. I also improved from_dict performance.

Once it is released wrap this in polars. 👍

ofajardo commented 2 years ago

sweet! So you would create a method in polars to read spss that wraps pyreadstat.read_sav? Or what you mean is that I should use polars from_dict in pyreadstat?

ritchie46 commented 2 years ago

So you would create a method in polars to read spss that wraps pyreadstat.read_sav?

Yes. :)

ofajardo commented 2 years ago

Cool! I'll release soon

jccguma commented 2 years ago

Thanks @ofajardo and @ritchie46 for your hard work! I really appreciate it.

ofajardo commented 2 years ago

OK, 1.1.6 is released

import pyreadstat
import polars

dicdata, meta = pyreadstat.read_sav('/path/to/a/file.sav', output_format='dict')
df = polars.DataFrame(dicdata)
ofajardo commented 2 years ago

as I didn't hear anything else, I assume it is working fine. Therefore closing this.

ofajardo commented 1 year ago

@ritchie46 @jccguma question: was this finally incorporated into polars? I was looking for the function/method and couldn't find it