JohnSnowLabs / nlu

1 line for thousands of State of The Art NLP models in hundreds of languages The fastest and most accurate way to solve text problems.
Apache License 2.0
839 stars 129 forks source link

DataFrame problem with pyspark and pandas interaction #202

Closed Vlod-github closed 9 months ago

Vlod-github commented 9 months ago

When executing the following code, an error occurs

from johnsnowlabs import nlp

pipeline = nlp.load('sentiment')
pipeline.predict("I love this Documentation! It's so good!")
...
Approximate size to download 354.6 KB
Download done! Loading the resource.
[OK!]
Warning::Spark Session already created, some configs may not take.
Traceback (most recent call last):
  File "/home/user/Documents/test/nlu/test_maen.py", line 8, in <module>
    pipeline.predict("I love this Documentation! It's so good!")
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 468, in predict
    return __predict__(self, data, output_level, positions, keep_stranger_features, metadata, multithread,
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/utils/predict_helper.py", line 166, in __predict__
    pipe.fit()
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 202, in fit
    self.vanilla_transformer_pipe = self.spark_estimator_pipe.fit(self.get_sample_spark_dataframe())
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/nlu/pipe/pipeline.py", line 101, in get_sample_spark_dataframe
    return sparknlp.start().createDataFrame(data=text_df)
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/session.py", line 603, in createDataFrame
    return super(SparkSession, self).createDataFrame(
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/pandas/conversion.py", line 299, in createDataFrame
    data = self._convert_from_pandas(data, schema, timezone)
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pyspark/sql/pandas/conversion.py", line 327, in _convert_from_pandas
    for column, series in pdf.iteritems():
  File "/home/user/Documents/test/nlu/.venv/lib64/python3.10/site-packages/pandas/core/generic.py", line 6202, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'iteritems'. Did you mean: 'isetitem'?

There is a solution for this error on stackoverflow Maybe you should specify the right version in the dependencies of the johnsnowlabs module? For example pandas >= 1.3.5, < 2

Platform - Fedora Linux 36

openjdk version "11.0.19" 2023-04-18
OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2.fc36) (build 11.0.19+7)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2.fc36) (build 11.0.19+7, mixed mode, sharing)
C-K-Loan commented 9 months ago

Hi @Vlod-github NLU <= 5.0.1 is not compatible with pandas >=2. If you downgrad pandas to 1.5.3 this bug will be fixed .

The latest NLU 5.0.2 release will fix bug aswell, so you can that asewll

C-K-Loan commented 9 months ago

fixed in nlu 502 https://github.com/JohnSnowLabs/nlu/pull/206