biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.86k stars 1.01k forks source link

New widget : cleansing data #6640

Closed simonaubertbd closed 10 months ago

simonaubertbd commented 11 months ago

What's your use case? More than once, the data I work on contains null values, or unwanted space at the end of fields, and sometimes a full field is empty or there totally empty rows in the middle of the data set. There aren't a lot of case but it's happen so many times a widget to automatize that would be great.

What's your proposed solution? A widget with the main cleaning operations. Somthing like that : image

Are there any alternative solutions? using on all the concerned workflows several widgets to process the data.

wvdvegte commented 11 months ago

Have you checked out the widgets Impute, Unique and Preprocess?

simonaubertbd commented 11 months ago

hello @wvdvegte This time, yes. Also the preprocess for text. But it's not exactly the same thing since here, you can also clean strings, choose on which fields to apply, etc. The idea is more to have a better data quality.

Best regards,

Simon

wvdvegte commented 11 months ago

There's also a lot you can do using the Formula widget and any Python code that fits into a one-line variable assignment, e.g. removing (leading, trailing or all) spaces, case modifications and many other things as long as it doesn't require external libraries or multiple lines of codes. For inexperienced programmers like me, AI chatbots can very effectively be used to generate such code. Anyway, if the typical cleansing actions that you refer to appear to be universal, it might indeed be a good idea to unify them in a new widget, or add them to one of the preprocessing widgets,

ajdapretnar commented 11 months ago

@simonaubertbd My first impression is your task can be achieved with some combination of existing widgets. Admittedly, for some specifics, you would indeed need Python Script, particularly for text handling. Case by case:

If nothing else, such as widget is more text-specific than general Orange. I need to be convinced of its general applicability first. At the moment, it seems specific for you own workflow.