d2cml-ai / csdid

CSDID
https://d2cml-ai.github.io/csdid/index.html
MIT License
20 stars 5 forks source link

Error using PySpark with Patsy. #10

Closed TJhon closed 1 year ago

TJhon commented 1 year ago

Unable to obtain matrices from the formula using PySpark.

from csdid.att_gt import ATTgt
import pandas as pd, patsy
import pyspark.pandas as ps
from pyspark.sql import SparkSession

data = pd.read_csv("https://raw.githubusercontent.com/d2cml-ai/csdid/function-aggte/data/mpdta.csv")
psdata = ps.DataFrame(data)
patsy.dmatrices('lemp~1', data = psdata)
---------------------------------------------------------------------------
PandasNotImplementedError                 Traceback (most recent call last)
[<ipython-input-13-1b3e5ec1da9c>](https://localhost:8080/#) in <cell line: 2>()
      1 import patsy
----> 2 patsy.dmatrices('lemp~1', data = psdata, return_type='matrix')

7 frames
[/usr/local/lib/python3.10/dist-packages/pyspark/pandas/missing/__init__.py](https://localhost:8080/#) in unsupported_function(*args, **kwargs)
     21 def unsupported_function(class_name, method_name, deprecated=False, reason=""):
     22     def unsupported_function(*args, **kwargs):
---> 23         raise PandasNotImplementedError(
     24             class_name=class_name, method_name=method_name, reason=reason
     25         )

PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
TJhon commented 1 year ago

To find an equivalent of patsy, the formula was transformed into a list of strings representing the columns and then converted into a single matrix.

Fix in pyspark merge