Closed stevespringett closed 1 year ago
@stevespringett could you add a glossary? what is "ML" , "BIML" and such? maybe you could edit the initial comment and add links to the terms.
I think maybe we should also consider training data sets as components of a ML model.
Training sets, data/model licenses, relevant metrics, external references to the training sets, related artifacts, model cards etc and a way to specify relationships to the software components involved in the training environment would definitely be important. Other attributes maybe domain related. For eg, in NLP, the language that the model is expected to work on is very important. For deep learning models, their architecture is very important.
A good exercise would be to take a look at some existing model stores/model card projects and look at the metadata they capture and see what fits/what is missing.
Some relevant links - https://huggingface.co/docs/hub/model-repos https://github.com/mlflow/mlflow/blob/master/mlflow/store/model_registry/dbmodels/models.py https://modelcards.withgoogle.com/model-reports https://github.com/google/ml-metadata
I suspect we would also need a component type of dataset to fully describe a model.
Of these all, as an MLE I've leaned towards MLflow in the past because it provides for both model feature/parameter and hyper-parameter tagging. Hyper-parameter configurations are external to the model and cannot be estimated from data - think if I'm tracking both my model performance AND my real-time cloud compute costs, that's the config I need do to that.
I think ML support in CDX will be critical in the near future.
Although this bill was just introduced and may or may not pass, there seems to be a clear need for increased transparency into these algorithms. https://www.wyden.senate.gov/news/press-releases/wyden-booker-and-clarke-introduce-algorithmic-accountability-act-of-2022-to-require-new-transparency-and-accountability-for-automated-decision-systems
datasets and their provenance is a confirmed use case that needs to be addressed. Datasets also have licenses. Some are "free", others are commercial, etc. So datasets themselves should reuse existing license support.
This might provide a good starting point. https://www.gov.uk/government/collections/algorithmic-transparency-standard
Related thread modeling framework for ML: https://plot4.ai/
This is more future-facing, but the IBM AI Factsheets are one of the more practical implementations of model fact sheets presently.
Update: An updated modelCard view is available in the 1.5 workstreams repo. The data card view will likely tie into data
, a new top level property in a BOM supporting low-code/no-code apps among other things.
ML models could theoretically be represented in the inventory. ML is often abstracted behind a service making it easier to consume, but if you wanted to describe the models themselves, I think there may be a way to achieve this.
The thought is to support
The BIML Taxonomy of ML attacks has the following categories:
Ideally, CycloneDX support for ML should not only contain ML models, but should ideally be able to communicate potential or confirmed risk within this taxonomy.
Glossary