RFC `Nodes` and `DataContainers` extension for supporting scikit-learn

jjerphan commented 10 months ago

Context: scikit-learn's usage and specificities

While the current Nodes and DataContainers this is sufficient for most library like SciPy and NumPy which can entirely be used with free function, other libraries — like scikit-learn — have other worflows relying on state-full instances of classes they defined.

In the case of scikit-learn:

Most workflow generally rely on instances of Estimators (i.e. generally Regressors, Classifiers and Transformers) and a few methods on those instances (basically fit, predict, predict_proba, score, score_samples).
Those instances generally are composed within sklearn.Pipeline, themselves being a sklearn.MetaEstimator.
Estimators accept and returns NumPy arrays and common Python objects (int, float, str, dict, list, tuple). As of 1.2, scikit-learn has an extended support for pandas.DataFrame (pandas is not a dependence of scikit-learn).
(Less important) After Estimators are fit, public fitted attributes (parts of the those instances' states) can be accessed to have access to relevant information.
- For most Estimators, public fitted attributes' access is useful (it provides additional information) but is not strictly required
- For some Estimators, public fitted attributes' access was the goal of have the Estimators fit and thus is required
(Less important) Estimators have specific public methods (e.g. cost_complexity_pruning_path for sklearn.tree.DecisionTreeClassifier). Those are defined either in final classes or common mixin or base classes.

There are already some existing nodes that are using scikit-learn under AI_ML and GENERATORS under the hood such as:

AI_ML/CLASSIFICATION/SUPPORT_VECTOR_MACHINE/SUPPORT_VECTOR_MACHINE.py
4:from sklearn import svm, preprocessing

AI_ML/CLASSIFICATION/TRAIN_TEST_SPLIT/TRAIN_TEST_SPLIT.py
3:from sklearn.model_selection import train_test_split

AI_ML/NLP/COUNT_VECTORIZER/COUNT_VECTORIZER.py
2:from sklearn.feature_extraction.text import CountVectorizer

GENERATORS/SAMPLE_DATASETS/TEXT_DATASET/TEXT_DATASET.py
2:from sklearn.datasets import fetch_20newsgroups
3:from sklearn.utils import Bunch

Depending on the use-cases Flojoy wants to target, we might want to develop Nodes:

for specific topics (like the current ones for AI and ML applications)
or for specific open-source projects (like the current ones for NumPy and SciPy)
or for both

This RFC mainly aims at defining this second option.

Proposed scope: focus only on the minimal required steps

The minimal required steps are the following:

Loading or creating a dataset materialized as X, y, two NumPy arrays
Various pre-processing of the datasets (scaling, encoding, etc.)
Splitting the dataset in several folds.
- Canonically, X and y get split as:
  - X_train and y_train: to fit an estimator
  - X_val and y_val: to evaluate an estimator performance during the model selection
  - X_test and X_test: to evaluate the final chosen estimator performance
- Model selection abstractions (such as sklearn.model_selection.GridSearchCV) generally take care of training and validation, so X and y get split as:
  - X_train and y_train: to fit estimators and evaluate their performance during the model selection (they are further split in the process)
  - X_test and X_test: to evaluate the final chosen model performance
Fitting an estimator (a MetaEstimator if model selection is used)
(Evaluating the model)
Scoring the final estimator
Predicting using the final estimator

In scikit-learn, this scope non-exhaustively targets the following interfaces:

Free functions:
Classes:
- Transformers:
  - [ ] sklearn.preprocessing.StandardScaler
  - [ ] sklearn.preprocessing.OneHotEncoder
- MetaEstimators:
  - [ ] sklearn.model_selection.GridSearchCV
- Regressors:
  - [ ] sklearn.linear_model.LinearRegression
  - [ ] sklearn.tree.DecisionTreeRegressor
- Classifiers:
  - [ ] sklearn.linear_model.LogisticRegression
  - [ ] sklearn.tree.DecisionTreeClassifier

For now a first minimal support of scikit-learn, I propose considering the following as out of scope of for now:

Support of sklearn.Pipeline
Support of pandas.DataFrame within scikit-learn
Instances Public attributes' access
Estimator-specific methods

Proposed design

[ ] Define DataContainers specifically for most of scikit-learn's Estimator and Transformer.
- We might want to follow/reuse the common mixins and bases classes semantics of scikit-learn for type-checking Nodes' inputs.
- We might want to define "Fitted" version of those DataContainers .
[ ] Define Nodes to load or create datasets
- from scikit-learn's datasets (sklearn.datasets.load_*)
- from scikit-learn's samples generators (sklearn.datasets.make_*)
- from a CSV file (via pandas.read_csv)
[ ] Define Nodes for the main methods:
- Methods to consider:
  - [ ] fit
  - [ ] predict
  - [ ] predict_proba
  - [ ] score
  - [ ] score_samples
- Inputs:
  - Aforementioned DataContainers
  - OrderedPair generally

Proposed metric of success

Being able to produce examples similar to the ones of scikit-learn in Flojoy, such as:

References

Glossary of Common Terms and API Elements
Common scikit-learn mixins and base classes:
- Documentation
- Code
Pipelines and composite estimators

jackparmer commented 10 months ago

cc @dstrande @dstrande @Ben-Epstein @Roulbac ☝️

dstrande commented 10 months ago

Nice very detailed @jjerphan

By DataFrame support I'm guessing you mean all the functions like .max, .pivot, .apply, etc. ? (see the sidebar here)

I also want to ask people who built the backend (like @smahmed776 ) if they think this will require major changes to the backend beyond DataContainer. Adding the ability to pass classes in Flojoy is a bit different than what we're currently doing.

jjerphan commented 10 months ago

By DataFrame support I meant supporting passing pandas.DataFrames to scikit-learn interfaces, be they free functions or instances of classes.

trbritt commented 10 months ago

Hi Julien,

I wanted to add an example that should be a good target for this integration, using an industry application we've already been contacted about: semiconductor wafer quality assessment.

What is the data: greyscale images of semiconductor wafers (resolution ~ 50x50) What is the goal: given an input image, identify the quality assessment of the wafer

The failure types we are interested in are the following:

Center defects,
Donut defects (annular defect about the center of the wafer),
Edge-Loc (meaning a defect located directly on the edge of the wafer),
Edge-Ring (meaning a defect that spans the entire perimeter of the wafer),
Loc (a localised point defect inside the wafer),
Near-full (near total production failure),
Scratch (a clean thin line along the surface of the wafer),
Random (meaning none of the above),
None (no defect)

Given the complication of categorizing each image into any of these categories, it is a perfect test case for an ML application.

For reference to train the model, please use the dataset found here, which is a cleaned version of the data found here. I've included a brief visualization of 100 wafers from this dataset in the video below, generated from a little gist here.

Once you get a model trained to correctly identify the images in the example dataset, the functionality can then be ported to Flojoy, at which point I will have finished integrating batch processing into Flojoy.

https://github.com/flojoy-ai/studio/assets/53545754/2926791f-0fe7-488f-aa7d-06f64a51d856

jjerphan commented 10 months ago

Hi Tristan,

I have several question:

We could try something with scikit-learn, but due to the nature of the problem I think a simple CNN (creatable using a Deep Learning framework) might perform way better for this case. What do you think?
You mentioned "semiconductor wafer quality assessment" as an industrial application. Do you have any other applications relying on image processing that you are targeting?
Would having Nodes to load pre-trained models within Flojoy be valuable?

trbritt commented 10 months ago

Hi Julien,

With regards to the example case, I do think something like an MLP / CNN would be best for that. As far as I know, the functionality in from sklearn.neural_network import MLPClassifier would provide a class that has its fit and predict methods as well that would fit into the proposed plan already, no? I think it would be good to also add this level of functionality to your plan.

Other than that, the proposed approach sounds good to me (with the addition above). The scope you've defined seems to be very nice for this first integration. I would say you can go ahead with this plan (if @Roulbac @dstrande @Ben-Epstein approve as well).

I do think it would be valuable for users to be able to input their own pre-trained models. Many industry partners have already spent massive computational resources on various models, and if they can just easily insert them into Flojoy, I think it will make our product and its functionality more attractive to potential customers.

Ben-Epstein commented 10 months ago

👋 I'll break my thoughts into a few section

The wafer quality example

due to the nature of the problem I think a simple CNN [...] might perform way better for this case

Definitely agree, this is not a feasible use-case for sklearn in my opinion. And I don't think it aligns well with the typical use-cases of sklearn users.

Sklearn models are often

very fast to train
run on datasets that are reasonable in size (will fit on a single machine)
small (in terms of the trained model)
classic ML algorithms (non-deep learning, with the simple exception of the MLP, which is not all that powerful).

I would focus on examples that map to these criteria.

Scope and design

If I'm understanding your proposal correctly (building a node for each of the components listed under Proposed design), I don't think this is a scalable way to support sklearn. In fact, i've done something similar in my past and it's incredibly tedious, as there are loads of different models that users may want.

I would instead suggest considering a framework that has a node for

transformers
classifiers
regressors

Each of those has required parameters such as

class (classifier type, transformer type etc)
baseline params for the baseclass

You could even extend that to have dynamic parameters that are based on the class chosen. For example

if they select classifiers -> decision tree, then you let them pick min_leafs
if they select classifiers -> random forest, then you let them pick num-trees

This will let you scale much more easily, both from a development perspective and from a UX perspective, as having a node per classifier in the UI might be hard to navigate.

Out of scope

If you want to make Pipelines out of scope, you should consider talking to your perspective audience and understanding their use-cases. For example, a very common practice is to have pipelines that employ FunctionTransformers that take arbitrary python code and execute it over a dataframe. This is pretty valuable to ML users, but I don't know your audience.

Similarly, dataframes are pretty standard in ML over arrays/ordered pairs. They offer that necessary structure, so I'd again consider talking to your customers to get a better idea of their wants.

Pre-trained models

This is incredibly valuable and should definitely be considered. There are 2 components to this

Pre-trained models that users can load in: this should be pretty simple, just have the user select (1) the framework and (2) where the model is stored, and you can load it and predict with it based on the framework

pre-trained models not from the user. I'd suggest doing this with HuggingFace Pipelines, as they are plug and play. For example, text_gen_model = pipeline("text2text-generation") will give you an LLM out of the box with a straightfowrard interface to make predictions. These are the available out-of-the-box pipelines from huggingface

['audio-classification', 'automatic-speech-recognition', 'conversational', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'image-to-text', 'mask-generation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text2text-generation', 'token-classification', 'translation', 'video-classification', 'visual-question-answering', 'vqa', 'zero-shot-audio-classification', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY']

jjerphan commented 10 months ago

With regards to the example case, I do think something like an MLP / CNN would be best for that. As far as I know, the functionality in from sklearn.neural_network import MLPClassifier would provide a class that has its fit and predict methods as well that would fit into the proposed plan already, no? I think it would be good to also add this level of functionality to your plan.

Including sklearn.neural_network.MLP{Classifier,Regressor} definitely can be added without any supplementary cost, and I am not against that even-though scikit-learn's MLP aren't the most flexible and performant.

What I meant, is that providing CNN might be more adapted for classification or regression problems since those architectures make use of the hierarchical structure of n-d signals much more than MLPs. I think the use-cases you are provided with motivate the introduction of nodes from (or least workflow using) deep learning frameworks. Even-though this might be out of the scope of scikit-learn's support within Flojoy, some frameworks (like Keras) have really similar API and UX to scikit-learn's, and the work on integrating scikit-learn might help the one for theirs.

If supporting those frameworks make sense, we might want to open discussions for that. What do you think?

Ben-Epstein commented 10 months ago

If supporting those frameworks make sense, we might want to open discussions for that. What do you think?

I agree it's a different topic, and one worth having. But I'd just toss in that you should strongly consider using huggingface over keras. It's a much simpler framework, and I imagine that a large percentage of use cases from customers will have pre-trained models already on the hub

jjerphan commented 10 months ago

I confirm that supporting HuggingFace's pipeline would help users solve a variety of problems scikit-learn is not suited as a solution for.

Depending on Flojoy's vision or targeted uses-cases (which I do not know entirely), scikit-learn might not be as relevant as other solutions.

Would you like to provide your users with:

the ability to solve a variety of problems with effectiveness without being involved with code but mainly a web UI (like Gradio)?
the ability to program with finer grained blocks constructing graphs (like Simulink)? Note that this was my current understanding of Flojoy's vision when writing this RFC.

Roulbac commented 10 months ago

Thank you @jjerphan for initiating this conversation, here is my feedback on the matter.

Firstly, I want to bring to everyone's attention the utility of model inference in the context of Flojoy versus model training. With a myriad of complexities around model training, Flojoy can really shine much more easily by catering to pre-trained models which users want to deploy with ease. Please bare in mind that this doesn't mean we should drop model training at all, but rather focus more energy on model inference while still catering for simpler model training scenarios.

Why Prioritize Inference over Training:

Data Preparation and Model Training: The steps of data ingestion, curation, and training are non-trivial. While it's tempting to make Flojoy the all-in-one solution, catering to these processes may detract from making Flojoy truly stellar at what it's designed for - ease of deploying (AI) applications. Data preparation often happens interactively in ephemeral environments that allow the users to iterate and visualize quickly (which is why Jupyter is great at that), whereas Flojoy is really designed to build pipelines.
Foundation Models: Pre-trained models, such as an image classifier that detects humans, could be more universally valuable to our users. Once loaded into Flojoy, users can fine-tune them (this would be a simple training use-case) or use them as-is for inference on new data.
Back to the scope of Jupyter vs Flojoy: Jupyter is excellent for iterative data processing and exploration, and it might not be in Flojoy's best interest to replicate this interactive capability. Instead, Flojoy can prioritize seamless integration of pre-trained models, perhaps even those developed in Jupyter, for fast and efficient model deployment, on top of simple model training use-cases that users can do on Flojoy.

Feedback on the Issue:

Wafer Quality Assessment: Given the intricacies of image classification, especially for semiconductor wafers, CNNs would indeed be a more suited choice over traditional ML models. However, considering that the proposal revolves around scikit-learn's capabilities, the MLPClassifier could serve as a basic starting point. Although, I'd agree with @ben-epstein that this might not be the best use-case for scikit-learn.
Incorporating Pre-trained Models: @trbritt's point on allowing users to input their own pre-trained models is important. Industry partners who have invested computational resources in training models would find this functionality invaluable. This approach aligns well with the idea of emphasizing on model inference.

TL;DR There is a lot to gain in supporting pre-trained pipelines and simple model training/fine-tuning use-cases, and it would be much harder to make Flojoy a fully-fledged model training platform. This thought would be important to keep in mind while making design choices for the platform. The HF pipelines is an excellent example of what Flojoy could do very well.

jjerphan commented 10 months ago

Thank you for this comprehensive comment, @Roulbac. I agree with everything that you have exposed.

After identifying Flojoy's direction and relevant use-cases, I think that the support of scikit-learn (which QuantStack was contacted for) might not be as relevant (for now) as deploying models.

I propose that we open another RFC dedicated to Model Deployment within Flojoy to pursue discussions. What do you think?

jackparmer commented 10 months ago

I propose that we open another RFC dedicated to Model Deployment within Flojoy to pursue discussions

+1 I agree ☝️

flojoy-ai / studio