Open jjerphan opened 1 year ago
cc @dstrande @dstrande @Ben-Epstein @Roulbac ☝️
Nice very detailed @jjerphan
By DataFrame support I'm guessing you mean all the functions like .max, .pivot, .apply, etc. ? (see the sidebar here)
I also want to ask people who built the backend (like @smahmed776 ) if they think this will require major changes to the backend beyond DataContainer. Adding the ability to pass classes in Flojoy is a bit different than what we're currently doing.
By DataFrame support I meant supporting passing pandas.DataFrames
to scikit-learn interfaces, be they free functions or instances of classes.
Hi Julien,
I wanted to add an example that should be a good target for this integration, using an industry application we've already been contacted about: semiconductor wafer quality assessment.
What is the data: greyscale images of semiconductor wafers (resolution ~ 50x50) What is the goal: given an input image, identify the quality assessment of the wafer
The failure types we are interested in are the following:
Given the complication of categorizing each image into any of these categories, it is a perfect test case for an ML application.
For reference to train the model, please use the dataset found here, which is a cleaned version of the data found here. I've included a brief visualization of 100 wafers from this dataset in the video below, generated from a little gist here.
Once you get a model trained to correctly identify the images in the example dataset, the functionality can then be ported to Flojoy, at which point I will have finished integrating batch processing into Flojoy.
https://github.com/flojoy-ai/studio/assets/53545754/2926791f-0fe7-488f-aa7d-06f64a51d856
Hi Tristan,
I have several question:
Nodes
to load pre-trained models within Flojoy be valuable?Hi Julien,
With regards to the example case, I do think something like an MLP / CNN would be best for that. As far as I know, the functionality in from sklearn.neural_network import MLPClassifier
would provide a class that has its fit
and predict
methods as well that would fit into the proposed plan already, no? I think it would be good to also add this level of functionality to your plan.
Other than that, the proposed approach sounds good to me (with the addition above). The scope you've defined seems to be very nice for this first integration. I would say you can go ahead with this plan (if @Roulbac @dstrande @Ben-Epstein approve as well).
I do think it would be valuable for users to be able to input their own pre-trained models. Many industry partners have already spent massive computational resources on various models, and if they can just easily insert them into Flojoy, I think it will make our product and its functionality more attractive to potential customers.
👋 I'll break my thoughts into a few section
due to the nature of the problem I think a simple CNN [...] might perform way better for this case
Definitely agree, this is not a feasible use-case for sklearn in my opinion. And I don't think it aligns well with the typical use-cases of sklearn users.
Sklearn models are often
I would focus on examples that map to these criteria.
If I'm understanding your proposal correctly (building a node for each of the components listed under Proposed design
), I don't think this is a scalable way to support sklearn. In fact, i've done something similar in my past and it's incredibly tedious, as there are loads of different models that users may want.
I would instead suggest considering a framework that has a node for
Each of those has required parameters such as
You could even extend that to have dynamic parameters that are based on the class chosen. For example
This will let you scale much more easily, both from a development perspective and from a UX perspective, as having a node per classifier in the UI might be hard to navigate.
If you want to make Pipelines out of scope, you should consider talking to your perspective audience and understanding their use-cases. For example, a very common practice is to have pipelines that employ FunctionTransformers
that take arbitrary python code and execute it over a dataframe. This is pretty valuable to ML users, but I don't know your audience.
Similarly, dataframes are pretty standard in ML over arrays/ordered pairs. They offer that necessary structure, so I'd again consider talking to your customers to get a better idea of their wants.
This is incredibly valuable and should definitely be considered. There are 2 components to this
Pipelines
, as they are plug and play. For example, text_gen_model = pipeline("text2text-generation")
will give you an LLM out of the box with a straightfowrard interface to make predictions. These are the available out-of-the-box pipelines from huggingface
['audio-classification', 'automatic-speech-recognition', 'conversational', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'image-to-text', 'mask-generation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text2text-generation', 'token-classification', 'translation', 'video-classification', 'visual-question-answering', 'vqa', 'zero-shot-audio-classification', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY']
With regards to the example case, I do think something like an MLP / CNN would be best for that. As far as I know, the functionality in from sklearn.neural_network import MLPClassifier would provide a class that has its fit and predict methods as well that would fit into the proposed plan already, no? I think it would be good to also add this level of functionality to your plan.
Including sklearn.neural_network.MLP{Classifier,Regressor}
definitely can be added without any supplementary cost, and I am not against that even-though scikit-learn's MLP aren't the most flexible and performant.
What I meant, is that providing CNN might be more adapted for classification or regression problems since those architectures make use of the hierarchical structure of n-d signals much more than MLPs. I think the use-cases you are provided with motivate the introduction of nodes from (or least workflow using) deep learning frameworks. Even-though this might be out of the scope of scikit-learn's support within Flojoy, some frameworks (like Keras) have really similar API and UX to scikit-learn's, and the work on integrating scikit-learn might help the one for theirs.
If supporting those frameworks make sense, we might want to open discussions for that. What do you think?
If supporting those frameworks make sense, we might want to open discussions for that. What do you think?
I agree it's a different topic, and one worth having. But I'd just toss in that you should strongly consider using huggingface over keras. It's a much simpler framework, and I imagine that a large percentage of use cases from customers will have pre-trained models already on the hub
I confirm that supporting HuggingFace's pipeline would help users solve a variety of problems scikit-learn is not suited as a solution for.
Depending on Flojoy's vision or targeted uses-cases (which I do not know entirely), scikit-learn might not be as relevant as other solutions.
Would you like to provide your users with:
Thank you @jjerphan for initiating this conversation, here is my feedback on the matter.
Firstly, I want to bring to everyone's attention the utility of model inference in the context of Flojoy versus model training. With a myriad of complexities around model training, Flojoy can really shine much more easily by catering to pre-trained models which users want to deploy with ease. Please bare in mind that this doesn't mean we should drop model training at all, but rather focus more energy on model inference while still catering for simpler model training scenarios.
Why Prioritize Inference over Training:
Data Preparation and Model Training: The steps of data ingestion, curation, and training are non-trivial. While it's tempting to make Flojoy the all-in-one solution, catering to these processes may detract from making Flojoy truly stellar at what it's designed for - ease of deploying (AI) applications. Data preparation often happens interactively in ephemeral environments that allow the users to iterate and visualize quickly (which is why Jupyter is great at that), whereas Flojoy is really designed to build pipelines.
Foundation Models: Pre-trained models, such as an image classifier that detects humans, could be more universally valuable to our users. Once loaded into Flojoy, users can fine-tune them (this would be a simple training use-case) or use them as-is for inference on new data.
Back to the scope of Jupyter vs Flojoy: Jupyter is excellent for iterative data processing and exploration, and it might not be in Flojoy's best interest to replicate this interactive capability. Instead, Flojoy can prioritize seamless integration of pre-trained models, perhaps even those developed in Jupyter, for fast and efficient model deployment, on top of simple model training use-cases that users can do on Flojoy.
Feedback on the Issue:
Wafer Quality Assessment: Given the intricacies of image classification, especially for semiconductor wafers, CNNs would indeed be a more suited choice over traditional ML models. However, considering that the proposal revolves around scikit-learn's capabilities, the MLPClassifier
could serve as a basic starting point. Although, I'd agree with @ben-epstein that this might not be the best use-case for scikit-learn.
Incorporating Pre-trained Models: @trbritt's point on allowing users to input their own pre-trained models is important. Industry partners who have invested computational resources in training models would find this functionality invaluable. This approach aligns well with the idea of emphasizing on model inference.
TL;DR There is a lot to gain in supporting pre-trained pipelines and simple model training/fine-tuning use-cases, and it would be much harder to make Flojoy a fully-fledged model training platform. This thought would be important to keep in mind while making design choices for the platform. The HF pipelines is an excellent example of what Flojoy could do very well.
Thank you for this comprehensive comment, @Roulbac. I agree with everything that you have exposed.
After identifying Flojoy's direction and relevant use-cases, I think that the support of scikit-learn (which QuantStack was contacted for) might not be as relevant (for now) as deploying models.
I propose that we open another RFC dedicated to Model Deployment within Flojoy to pursue discussions. What do you think?
I propose that we open another RFC dedicated to Model Deployment within Flojoy to pursue discussions
+1 I agree ☝️
Context: scikit-learn's usage and specificities
While the current
Nodes
andDataContainers
this is sufficient for most library like SciPy and NumPy which can entirely be used with free function, other libraries — like scikit-learn — have other worflows relying on state-full instances of classes they defined.In the case of scikit-learn:
Estimators
(i.e. generallyRegressors
,Classifiers
andTransformers
) and a few methods on those instances (basicallyfit
,predict
,predict_proba
,score
,score_samples
).sklearn.Pipeline
, themselves being asklearn.MetaEstimator
.Estimators
accept and returns NumPy arrays and common Python objects (int
,float
,str
,dict
,list
,tuple
). As of 1.2, scikit-learn has an extended support forpandas.DataFrame
(pandas is not a dependence of scikit-learn).Estimators
arefit
, public fitted attributes (parts of the those instances' states) can be accessed to have access to relevant information.Estimators
, public fitted attributes' access is useful (it provides additional information) but is not strictly requiredEstimators
, public fitted attributes' access was the goal of have theEstimators
fit and thus is requiredEstimators
have specific public methods (e.g.cost_complexity_pruning_path
forsklearn.tree.DecisionTreeClassifier
). Those are defined either in final classes or common mixin or base classes.There are already some existing nodes that are using scikit-learn under
AI_ML
andGENERATORS
under the hood such as:Depending on the use-cases Flojoy wants to target, we might want to develop
Nodes
:This RFC mainly aims at defining this second option.
Proposed scope: focus only on the minimal required steps
The minimal required steps are the following:
X
,y
, two NumPy arraysX
andy
get split as:X_train
andy_train
: to fit an estimatorX_val
andy_val
: to evaluate an estimator performance during the model selectionX_test
andX_test
: to evaluate the final chosen estimator performancesklearn.model_selection.GridSearchCV
) generally take care of training and validation, soX
andy
get split as:X_train
andy_train
: to fit estimators and evaluate their performance during the model selection (they are further split in the process)X_test
andX_test
: to evaluate the final chosen model performanceMetaEstimator
if model selection is used)In scikit-learn, this scope non-exhaustively targets the following interfaces:
sklearn.model_selection.train_test_split
sklearn.datasets.make_blobs
sklearn.datasets.make_classification
sklearn.datasets.make_regression
sklearn.datasets.load_iris
sklearn.preprocessing.StandardScaler
sklearn.preprocessing.OneHotEncoder
sklearn.model_selection.GridSearchCV
sklearn.linear_model.LinearRegression
sklearn.tree.DecisionTreeRegressor
sklearn.linear_model.LogisticRegression
sklearn.tree.DecisionTreeClassifier
For now a first minimal support of scikit-learn, I propose considering the following as out of scope of for now:
sklearn.Pipeline
pandas.DataFrame
within scikit-learnProposed design
DataContainers
specifically for most of scikit-learn'sEstimator
andTransformer
.Nodes
' inputs.DataContainers
.Nodes
to load or create datasetssklearn.datasets.load_*
)sklearn.datasets.make_*
)pandas.read_csv
)Nodes
for the main methods:fit
predict
predict_proba
score
score_samples
DataContainers
OrderedPair
generallyProposed metric of success
Being able to produce examples similar to the ones of scikit-learn in Flojoy, such as:
References