HarshCasper / NeoAlgo

Bringing all Data Structures and Algorithms under one Roof ⚡
MIT License
875 stars 1.05k forks source link

Data Preprocessing in ML #261

Closed Vrindagupta6828 closed 4 years ago

Vrindagupta6828 commented 4 years ago

Algorithm: Data preprocessing in ml

DS:

Languages Supported: Python

Vrindagupta6828 commented 4 years ago

I want to work on it @HarshCasper

HarshCasper commented 4 years ago

Hello @Vrindagupta6828

Thanks for raising this issue. We want to specifically know what sort of Preprocessing you want to work on. Will you be making Scripts for that or will you be showcasing a Jupyter Notebook to show how Preprocessing is done.

Vrindagupta6828 commented 4 years ago

Hello @harshcasper I will be showcasing a jupyter notebook to show how preprocessing is done befor applying any ml model.

HarshCasper commented 4 years ago

I would like to have suggestions from @VijayaGB98 and @ricardoprins here about these Issues.

ricardoprins commented 4 years ago

Well, this is such a complex topic, and rich of possibilities, that I find it highly unlikely that this can be contained in a small file.

@Vrindagupta6828 how familiar are you with ML basic concepts? Would you be interested in helping us in another Tesseract Coding project related to this topic?

vgb-codes commented 4 years ago

@Vrindagupta6828

Data preprocessing is dependent on the data and the type of data used. How will you incorporate all the possibilities in a single jupyter notebook. Also data preprocessing is also domain dependent. An example I can give is for histopathology images, stain normalization is applied depending whether or not the dataset is evenly stained.

Also the preprocessing will depend on what ml model you are using. Feature scaling is important in say K-NN and Neural Networks, but not required in say Decision Trees. Additionally, this may change depending on type of task (Regression/Classification) and whether regularization is used.

There are too many variables too be accounted for if you want to make a single Jupyter Notebook. My suggestion would be to allow contribution to previously existing notebooks where individuals can add sections for data preprocessing if it is not already available.

Vrindagupta6828 commented 4 years ago

@ricardoprins i can

Vrindagupta6828 commented 4 years ago

@VijayaGB98 ok i get it.