flojoy-ai / studio

Joyful visual programming for Python
https://docs.flojoy.ai
MIT License
187 stars 15 forks source link

RFC `Nodes` and `DataContainers` extension for supporting scikit-learn #823

Open jjerphan opened 10 months ago

jjerphan commented 10 months ago

Context: scikit-learn's usage and specificities

While the current Nodes and DataContainers this is sufficient for most library like SciPy and NumPy which can entirely be used with free function, other libraries — like scikit-learn — have other worflows relying on state-full instances of classes they defined.

In the case of scikit-learn:

There are already some existing nodes that are using scikit-learn under AI_ML and GENERATORS under the hood such as:

AI_ML/CLASSIFICATION/SUPPORT_VECTOR_MACHINE/SUPPORT_VECTOR_MACHINE.py
4:from sklearn import svm, preprocessing

AI_ML/CLASSIFICATION/TRAIN_TEST_SPLIT/TRAIN_TEST_SPLIT.py
3:from sklearn.model_selection import train_test_split

AI_ML/NLP/COUNT_VECTORIZER/COUNT_VECTORIZER.py
2:from sklearn.feature_extraction.text import CountVectorizer

GENERATORS/SAMPLE_DATASETS/TEXT_DATASET/TEXT_DATASET.py
2:from sklearn.datasets import fetch_20newsgroups
3:from sklearn.utils import Bunch

Depending on the use-cases Flojoy wants to target, we might want to develop Nodes:

This RFC mainly aims at defining this second option.

Proposed scope: focus only on the minimal required steps

The minimal required steps are the following:

In scikit-learn, this scope non-exhaustively targets the following interfaces:

For now a first minimal support of scikit-learn, I propose considering the following as out of scope of for now:

Proposed design

Proposed metric of success

Being able to produce examples similar to the ones of scikit-learn in Flojoy, such as:

References

jackparmer commented 10 months ago

cc @dstrande @dstrande @Ben-Epstein @Roulbac ☝️

dstrande commented 10 months ago

Nice very detailed @jjerphan

By DataFrame support I'm guessing you mean all the functions like .max, .pivot, .apply, etc. ? (see the sidebar here)

I also want to ask people who built the backend (like @smahmed776 ) if they think this will require major changes to the backend beyond DataContainer. Adding the ability to pass classes in Flojoy is a bit different than what we're currently doing.

jjerphan commented 10 months ago

By DataFrame support I meant supporting passing pandas.DataFrames to scikit-learn interfaces, be they free functions or instances of classes.

trbritt commented 10 months ago

Hi Julien,

I wanted to add an example that should be a good target for this integration, using an industry application we've already been contacted about: semiconductor wafer quality assessment.

What is the data: greyscale images of semiconductor wafers (resolution ~ 50x50) What is the goal: given an input image, identify the quality assessment of the wafer

The failure types we are interested in are the following:

Given the complication of categorizing each image into any of these categories, it is a perfect test case for an ML application.

For reference to train the model, please use the dataset found here, which is a cleaned version of the data found here. I've included a brief visualization of 100 wafers from this dataset in the video below, generated from a little gist here.

Once you get a model trained to correctly identify the images in the example dataset, the functionality can then be ported to Flojoy, at which point I will have finished integrating batch processing into Flojoy.

https://github.com/flojoy-ai/studio/assets/53545754/2926791f-0fe7-488f-aa7d-06f64a51d856

jjerphan commented 10 months ago

Hi Tristan,

I have several question:

trbritt commented 10 months ago

Hi Julien,

With regards to the example case, I do think something like an MLP / CNN would be best for that. As far as I know, the functionality in from sklearn.neural_network import MLPClassifier would provide a class that has its fit and predict methods as well that would fit into the proposed plan already, no? I think it would be good to also add this level of functionality to your plan.

Other than that, the proposed approach sounds good to me (with the addition above). The scope you've defined seems to be very nice for this first integration. I would say you can go ahead with this plan (if @Roulbac @dstrande @Ben-Epstein approve as well).

I do think it would be valuable for users to be able to input their own pre-trained models. Many industry partners have already spent massive computational resources on various models, and if they can just easily insert them into Flojoy, I think it will make our product and its functionality more attractive to potential customers.

Ben-Epstein commented 10 months ago

👋 I'll break my thoughts into a few section

The wafer quality example

due to the nature of the problem I think a simple CNN [...] might perform way better for this case

Definitely agree, this is not a feasible use-case for sklearn in my opinion. And I don't think it aligns well with the typical use-cases of sklearn users.

Sklearn models are often

I would focus on examples that map to these criteria.

Scope and design

If I'm understanding your proposal correctly (building a node for each of the components listed under Proposed design), I don't think this is a scalable way to support sklearn. In fact, i've done something similar in my past and it's incredibly tedious, as there are loads of different models that users may want.

I would instead suggest considering a framework that has a node for

Each of those has required parameters such as

You could even extend that to have dynamic parameters that are based on the class chosen. For example

This will let you scale much more easily, both from a development perspective and from a UX perspective, as having a node per classifier in the UI might be hard to navigate.

Out of scope

If you want to make Pipelines out of scope, you should consider talking to your perspective audience and understanding their use-cases. For example, a very common practice is to have pipelines that employ FunctionTransformers that take arbitrary python code and execute it over a dataframe. This is pretty valuable to ML users, but I don't know your audience.

Similarly, dataframes are pretty standard in ML over arrays/ordered pairs. They offer that necessary structure, so I'd again consider talking to your customers to get a better idea of their wants.

Pre-trained models

This is incredibly valuable and should definitely be considered. There are 2 components to this

  1. Pre-trained models that users can load in: this should be pretty simple, just have the user select (1) the framework and (2) where the model is stored, and you can load it and predict with it based on the framework
  2. pre-trained models not from the user. I'd suggest doing this with HuggingFace Pipelines, as they are plug and play. For example, text_gen_model = pipeline("text2text-generation") will give you an LLM out of the box with a straightfowrard interface to make predictions. These are the available out-of-the-box pipelines from huggingface
    ['audio-classification', 'automatic-speech-recognition', 'conversational', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'image-to-text', 'mask-generation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text2text-generation', 'token-classification', 'translation', 'video-classification', 'visual-question-answering', 'vqa', 'zero-shot-audio-classification', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY']
jjerphan commented 10 months ago

With regards to the example case, I do think something like an MLP / CNN would be best for that. As far as I know, the functionality in from sklearn.neural_network import MLPClassifier would provide a class that has its fit and predict methods as well that would fit into the proposed plan already, no? I think it would be good to also add this level of functionality to your plan.

Including sklearn.neural_network.MLP{Classifier,Regressor} definitely can be added without any supplementary cost, and I am not against that even-though scikit-learn's MLP aren't the most flexible and performant.

What I meant, is that providing CNN might be more adapted for classification or regression problems since those architectures make use of the hierarchical structure of n-d signals much more than MLPs. I think the use-cases you are provided with motivate the introduction of nodes from (or least workflow using) deep learning frameworks. Even-though this might be out of the scope of scikit-learn's support within Flojoy, some frameworks (like Keras) have really similar API and UX to scikit-learn's, and the work on integrating scikit-learn might help the one for theirs.

If supporting those frameworks make sense, we might want to open discussions for that. What do you think?

Ben-Epstein commented 10 months ago

If supporting those frameworks make sense, we might want to open discussions for that. What do you think?

I agree it's a different topic, and one worth having. But I'd just toss in that you should strongly consider using huggingface over keras. It's a much simpler framework, and I imagine that a large percentage of use cases from customers will have pre-trained models already on the hub

jjerphan commented 10 months ago

I confirm that supporting HuggingFace's pipeline would help users solve a variety of problems scikit-learn is not suited as a solution for.

Depending on Flojoy's vision or targeted uses-cases (which I do not know entirely), scikit-learn might not be as relevant as other solutions.

Would you like to provide your users with:

Roulbac commented 10 months ago

Thank you @jjerphan for initiating this conversation, here is my feedback on the matter.

Firstly, I want to bring to everyone's attention the utility of model inference in the context of Flojoy versus model training. With a myriad of complexities around model training, Flojoy can really shine much more easily by catering to pre-trained models which users want to deploy with ease. Please bare in mind that this doesn't mean we should drop model training at all, but rather focus more energy on model inference while still catering for simpler model training scenarios.

Why Prioritize Inference over Training:

  1. Data Preparation and Model Training: The steps of data ingestion, curation, and training are non-trivial. While it's tempting to make Flojoy the all-in-one solution, catering to these processes may detract from making Flojoy truly stellar at what it's designed for - ease of deploying (AI) applications. Data preparation often happens interactively in ephemeral environments that allow the users to iterate and visualize quickly (which is why Jupyter is great at that), whereas Flojoy is really designed to build pipelines.

  2. Foundation Models: Pre-trained models, such as an image classifier that detects humans, could be more universally valuable to our users. Once loaded into Flojoy, users can fine-tune them (this would be a simple training use-case) or use them as-is for inference on new data.

  3. Back to the scope of Jupyter vs Flojoy: Jupyter is excellent for iterative data processing and exploration, and it might not be in Flojoy's best interest to replicate this interactive capability. Instead, Flojoy can prioritize seamless integration of pre-trained models, perhaps even those developed in Jupyter, for fast and efficient model deployment, on top of simple model training use-cases that users can do on Flojoy.

Feedback on the Issue:

  1. Wafer Quality Assessment: Given the intricacies of image classification, especially for semiconductor wafers, CNNs would indeed be a more suited choice over traditional ML models. However, considering that the proposal revolves around scikit-learn's capabilities, the MLPClassifier could serve as a basic starting point. Although, I'd agree with @ben-epstein that this might not be the best use-case for scikit-learn.

  2. Incorporating Pre-trained Models: @trbritt's point on allowing users to input their own pre-trained models is important. Industry partners who have invested computational resources in training models would find this functionality invaluable. This approach aligns well with the idea of emphasizing on model inference.

TL;DR There is a lot to gain in supporting pre-trained pipelines and simple model training/fine-tuning use-cases, and it would be much harder to make Flojoy a fully-fledged model training platform. This thought would be important to keep in mind while making design choices for the platform. The HF pipelines is an excellent example of what Flojoy could do very well.

jjerphan commented 10 months ago

Thank you for this comprehensive comment, @Roulbac. I agree with everything that you have exposed.

After identifying Flojoy's direction and relevant use-cases, I think that the support of scikit-learn (which QuantStack was contacted for) might not be as relevant (for now) as deploying models.

I propose that we open another RFC dedicated to Model Deployment within Flojoy to pursue discussions. What do you think?

jackparmer commented 10 months ago

I propose that we open another RFC dedicated to Model Deployment within Flojoy to pursue discussions

+1 I agree ☝️