.setReadMonthFirst always reads month first

KyriakosAseto commented 11 months ago

Is there an existing issue for this?

[X] I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am using the example provider by spark nlp and customize the methods and I am trying to set to not read the month first

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setReadMonthFirst(False) \
    .setOutputFormat("dd/MM/yyyy")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setReadMonthFirst(False) \
    .setOutputCol("multi_date") \
    .setOutputFormat("dd/MM/yyyy")

Current Behavior

The parameter set to False does not matter as it always returns by first month from the input

Please see example "I was born at 01/03/98" which is indented to be 1st of March of 1998.

Expected Behavior

To read my example 01/03/1998 by not the month first

Steps To Reproduce

import sparknlp
from sparknlp.annotator import DocumentAssembler, DateMatcher, MultiDateMatcher
from pyspark.sql.types import StringType
from pyspark.ml import Pipeline

spark = sparknlp.start()
spark

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setReadMonthFirst(False) \
    .setOutputFormat("dd/MM/yyyy")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setReadMonthFirst(False) \
    .setOutputCol("multi_date") \
    .setOutputFormat("dd/MM/yyyy") 

pipeline = Pipeline().setStages([
    documentAssembler,
    date,
    multiDate
    ])

text_list = ["See you on next monday.", 
             "I was born at 01/03/98", 
             "She was born on 02/03/1966.", 
             "The project started yesterday and will finish next year.", 
             "She will graduate by July 2023.", 
             "She will visit doctor tomorrow and next month again."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text","date.result as date", "multi_date.result as multi_date").show(truncate=False)

Spark NLP version and Apache Spark

spark-nlp==5.2.0

Type of Spark Application

Python Application

Java Version

openjdk version "11.0.21" 2023-10-17

Java Home Directory

/usr/lib/jvm/java-11-openjdk-amd64

Setup and installation

pip install numpy py4j pyspark spark-nlp

Operating System and Version

Ubuntu-22.04

Link to your project (if available)

No response

Additional Information

No response

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days

Aleksis99 commented 4 months ago

This issue is still present.

JohnSnowLabs / spark-nlp