DataFrame problem with pyspark and pandas interaction

Vlod-github commented 9 months ago

When executing the following code, an error occurs

from johnsnowlabs import nlp

pipeline = nlp.load('sentiment')
pipeline.predict("I love this Documentation! It's so good!")

...
Approximate size to download 354.6 KB
Download done! Loading the resource.
[OK!]
Warning::Spark Session already created, some configs may not take.
Traceback (most recent call last):
  File "/home/user/Documents/test/nlu/test_maen.py", line 8, in <module>
    pipeline.predict("I love this Documentation! It's so good!")
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 468, in predict
    return __predict__(self, data, output_level, positions, keep_stranger_features, metadata, multithread,
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/utils/predict_helper.py", line 166, in __predict__
    pipe.fit()
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 202, in fit
    self.vanilla_transformer_pipe = self.spark_estimator_pipe.fit(self.get_sample_spark_dataframe())
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 101, in get_sample_spark_dataframe
    return sparknlp.start().createDataFrame(data=text_df)
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/session.py", line 603, in createDataFrame
    return super(SparkSession, self).createDataFrame(
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/pandas/conversion.py", line 299, in createDataFrame
    data = self._convert_from_pandas(data, schema, timezone)
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/pandas/conversion.py", line 327, in _convert_from_pandas
    for column, series in pdf.iteritems():
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pandas/core/generic.py", line 6202, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'iteritems'. Did you mean: 'isetitem'?

There is a solution for this error on stackoverflow Maybe you should specify the right version in the dependencies of the johnsnowlabs module? For example pandas >= 1.3.5, < 2

Platform - Fedora Linux 36

openjdk version "11.0.19" 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc36) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc36) (build 11.0.19+7, mixed mode, sharing)

C-K-Loan commented 9 months ago

Hi @Vlod-github NLU <= 5.0.1 is not compatible with pandas >=2. If you downgrad pandas to 1.5.3 this bug will be fixed .

The latest NLU 5.0.2 release will fix bug aswell, so you can that asewll

C-K-Loan commented 9 months ago

fixed in nlu 502 https://github.com/JohnSnowLabs/nlu/pull/206

JohnSnowLabs / nlu

DataFrame problem with pyspark and pandas interaction #202