Add support for ML models

stevespringett commented 2 years ago

ML models could theoretically be represented in the inventory. ML is often abstracted behind a service making it easier to consume, but if you wanted to describe the models themselves, I think there may be a way to achieve this.

The thought is to support

Supervised
- Regression
- Linear
- Decision tree (continuous)
- Random forest (continuous)
- Neural network (continuous)
- ...
- Classification
- Logistic regression
- Support vector machine
- Naive bayes
- Decision tree (discrete)
- Random forest (discrete)
- Neural network (discrete)
- ...
Unsupervised
- Clustering
- K-means
- Hierarchical
- Mean shift
- Density-based
- ...
- Dimensionality reduction
- Feature elimination
- Feature extraction
- Principal Component Analysis (PCA)
- ...

The BIML Taxonomy of ML attacks has the following categories:

input manipulation
data manipulation
model manipulation
input extraction
data extraction
model extraction

Ideally, CycloneDX support for ML should not only contain ML models, but should ideally be able to communicate potential or confirmed risk within this taxonomy.

Glossary

Abbreviation	Description
ML	Machine Learning
BIML	Berryville Institute of Machine Learning

jkowalleck commented 2 years ago

@stevespringett could you add a glossary? what is "ML" , "BIML" and such? maybe you could edit the initial comment and add links to the terms.

coderpatros commented 2 years ago

I think maybe we should also consider training data sets as components of a ML model.

samj1912 commented 2 years ago

Training sets, data/model licenses, relevant metrics, external references to the training sets, related artifacts, model cards etc and a way to specify relationships to the software components involved in the training environment would definitely be important. Other attributes maybe domain related. For eg, in NLP, the language that the model is expected to work on is very important. For deep learning models, their architecture is very important.

A good exercise would be to take a look at some existing model stores/model card projects and look at the metadata they capture and see what fits/what is missing.

Some relevant links - https://huggingface.co/docs/hub/model-repos https://github.com/mlflow/mlflow/blob/master/mlflow/store/model_registry/dbmodels/models.py https://modelcards.withgoogle.com/model-reports https://github.com/google/ml-metadata

I suspect we would also need a component type of dataset to fully describe a model.

Salkimmich commented 2 years ago

Of these all, as an MLE I've leaned towards MLflow in the past because it provides for both model feature/parameter and hyper-parameter tagging. Hyper-parameter configurations are external to the model and cannot be estimated from data - think if I'm tracking both my model performance AND my real-time cloud compute costs, that's the config I need do to that.

stevespringett commented 2 years ago

I think ML support in CDX will be critical in the near future.

Although this bill was just introduced and may or may not pass, there seems to be a clear need for increased transparency into these algorithms. https://www.wyden.senate.gov/news/press-releases/wyden-booker-and-clarke-introduce-algorithmic-accountability-act-of-2022-to-require-new-transparency-and-accountability-for-automated-decision-systems

https://www.congress.gov/bill/117th-congress/senate-bill/3572?q=%7B%22search%22%3A%5B%22cory+booker%22%2C%22cory%22%2C%22booker%22%5D%7D&s=7&r=5

stevespringett commented 2 years ago

datasets and their provenance is a confirmed use case that needs to be addressed. Datasets also have licenses. Some are "free", others are commercial, etc. So datasets themselves should reuse existing license support.

stevespringett commented 2 years ago

This might provide a good starting point. https://www.gov.uk/government/collections/algorithmic-transparency-standard

stevespringett commented 2 years ago

Related thread modeling framework for ML: https://plot4.ai/

stevespringett commented 2 years ago

https://github.com/mitre/advmlthreatmatrix

chrish42 commented 2 years ago

This is more future-facing, but the IBM AI Factsheets are one of the more practical implementations of model fact sheets presently.

stevespringett commented 1 year ago

Update: An updated modelCard view is available in the 1.5 workstreams repo. The data card view will likely tie into data, a new top level property in a BOM supporting low-code/no-code apps among other things.

CycloneDX / specification

Add support for ML models #112

Glossary