MultiClassifierDLApproach not transforming every row of my dataset

AntoineF3006 commented 7 months ago

Is there an existing issue for this?

[X] I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am currently working on a multi-output classification task, in order to classify some customers comments into several cateogories. I am using MultiClassifierDLApproach for this task, with already labeled data for training. I followed this tutorial : https://www.johnsnowlabs.com/mastering-text-classification-with-spark-nlp.

Current Behavior

After fitting my pipeline (described below) on my train set, I am transforming my train and test sets with said pipeline. The results are pretty good, but on some rows the column category is empty and I don't have any calculated probabilities for any category.

Expected Behavior

I was expecting every row to get the probabilities for every category : maybe not selected categories since I have put a treshold at 0.5, but at least the values for each category.

Steps To Reproduce

https://drive.google.com/file/d/1tmJYwZKBVZoHtLcuyWtWhsu6nbonKG-S/view?usp=sharing

On this zip you will find a .ipynb recreating the steps I used to create my pipeline, some sample data and their results, and said pipeline already fitted. The input column is texte_sw, the label is niveau_2_MC, the output is category. The issue seems to happen uniformly on my data, the time and date, the length or the number of words doesn't seem to be the problem.

Spark NLP version and Apache Spark

sparknlp.version() : 5.2.3 spark.version : 3.2.0.3.2.7170.1008-2'

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

AntoineF3006 commented 7 months ago

Hello @maziyarpanahi, is my issue complete enough or do I need to add some more context or data in order to discuss the subject ? Kind regards,

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days

JohnSnowLabs / spark-nlp