New file format for machine learning models

paulmueller commented 4 years ago

I propose a common file format for machine learning (ML) models (e.g. pre-trained neural nets for blood count). The idea is to train models using a simple Python script or with AIDeveloper and then load the model in dclab/Shape-Out/Shape-In for classification/filtering/sorting.

The file format should:

be able to contain multiple models
be human readable (except for the actual ML models)

I propose:

a zip file containing model files and metadata files in the JSON format
the .modc extension

A .modc file should contain:

the models (e.g. protocol buffer .pb files from tensorflow)
- when a model is applied to an event it outputs a number between 0 and 1 to indicate a probability of a classification match (1 is 100% probable)
- [EDIT by @phidahl] it should also be possible to define one model that can classify an event. In that case, the ml_score_??? features should sum up to a value <=1 where the remainder could be considered uncertainty or so. This needs to be discussed and best practice in current ML applications should be taken into account.
- [EDIT] The serialized model file should also contain the steps for pre- or post-processing. For tensorflow, this can be achieved as described in this tutorial: https://sayak.dev/tf.keras/preprocessing/2020/04/13/embedding-image-preprocessing-functions.html#Step-3:-SavedModel-plunge
one meta data file per model with
- a list of python libraries required to run the model (e.g. tensorflow) [possible security issues here]
- a version number (e.g. a date when it was created for reproducible research)
- sha256 hash and name of the model file
- name of the ml_score feature(s) the model provides
- the features used as input for the model
- human readable names of the features (e.g. Red Blood Cells for ml_score_rbc)
~recipes to compute additional features/inputs [possible security issues here]~ -> should be covered by serialized preprocessing
a readme file that describes what the model does

@phidahl Since our current analysis eco-system is Python-only, is this idea compatible with Shape-In (C++). We should probably stick to the tensorflow .pb file format. What are the restrictions?

phidahl commented 4 years ago

Hi Paul,

I agree with most of the points.

The results, however, could also be a vector of N numbers between 0..1 (sum should be 1.0 or < 1 ?) giving the probabilities that the given event belongs to one of these N classes. For blood count there will be more than 4 classes, and it’s more more straight forward to implement this classification in one model. We should think about how to handle uncertainty here: What if an event does not belong to either of the given classes?

For ShapeIn the idea is to use the OpenCV implementation to read and apply the models since it is very fast and uses only CPU. According to documentation it can read .pb or .pbtxt files (https://docs.opencv.org/4.3.0/d6/d0f/group__dnn.html#gac9b3890caab2f84790a17b306f36bd57 https://docs.opencv.org/4.3.0/d6/d0f/group__dnn.html#gac9b3890caab2f84790a17b306f36bd57) It’s been a while since I tested this, and this part of OpenCv is under active development so I’d have to repeat this with some reference .pb.

What I remember about Tensorflow is, that for training one is restricted to CPU or NVIDIA GPUs.. (Bad for mac computers) But there might be better OpenCL ways around this problem by now.

Am 27.05.2020 um 00:01 schrieb Paul Müller notifications@github.com:

I propose a common file format for machine learning (ML) models (e.g. pre-trained neural nets for blood count). The idea is to train models using a simple Python script or with AIDeveloper and then load the model in dclab/Shape-Out/Shape-In for classification/filtering/sorting.

The file format should:

be able to contain multiple models be human readable (except for the actual ML models) I propose:

a zip file containing model files and metadata files in the JSON format the .modc extension A .modc file should contain:

the models (e.g. protocol buffer .pb files from tensorflow) when a model is applied to an event it outputs a number between 0 and 1 to indicate a probability of a classification match (1 is 100% probable) one meta data file per model with a list of python libraries required to run the model (e.g. tensorflow) [possible security issues here] a version number (e.g. a date when it was created for reproducible research) sha256 hash and name of the model file name of the ml_score feature the model provides the features used as input for the model recipes to compute additional features/inputs [possible security issues here] @phidahl https://github.com/phidahl Since our current analysis eco-system is Python-only, is this idea compatible with Shape-In (C++). We should probably stick to the tensorflow .pb file format. What are the restrictions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ZELLMECHANIK-DRESDEN/dclab/issues/78, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADB5RTHQXYLAGX4MAMMXPQLRTQ32ZANCNFSM4NLHZX7Q.

paulmueller commented 4 years ago

Thanks for the input. Yes, having one model with multiple outputs should also be supported. I will add it to the list. I think a normalization across classes is interesting - but it does not really allow a comparison across events, right?

paulmueller commented 4 years ago

I think we should not worry about training speed right now. This can be done on dedicated hardware or by people with a lot of time and AIDeveloper. We still have to figure out which format to promote: onnx, pytorch, tensorflow -- whichever one has the best support across Python and C++ libraries.

maikherbig commented 4 years ago

Concerning ShapeIn and Out:

Keras model files (saved automatically in AID) contain information like

Backend (tensorflow, theano,...)
Keras_version
model_config (model architecture)
training_config (Info about parameters during training: loss function; learning rate...)
Optimizer weights (useful when you want to continue training at later time point)

Loading a keras model file using Python is very easy and also supported by AID. From there you could even continue training, do predictions on another .rtdc dataset, convert to a .pb file... But, I could not find anything that would allow loading a keras model using C++. The tensorflow .pb format might be better suited in this case.

tensorflow .pb files contain less information, which is only sufficient if you want to use it for inference. OpenCV (C++) supports loading and performing inference using .pb models. Martin N. already tried that but it did not seem to work yet. This is a while ago and probably this task is easier by now... As tensorflow (C++) is a few years around, it should be stable by now. PyTorch in contrast is rather new. In the latest release they write as highlight that the optimizers in C++ and Python now behave identically, but deviated beforehand. I was surprised that it was not the same in the first place as PyTorch runs on C++ backend. Who knows what else differs?

There are already sooo many different model formats out there and I would not suggest to reinvent the wheel by introducing yet another model format. I like the idea about the zip based format, that contains a model (keras or .pb?) and some meta information. In AID, there is automatically a .xlsx file saved containing all information necessary to run the model and even to reproduce the model training. Everyone can open these excel files to check parameters or whatever. Opening a JSON file might not be that straight forward. Model files like keras model, .pb can also be opened and viewed for example using Netron (https://github.com/lutzroeder/netron). I think to increase transparency, and user-friendliness one should stick to common formats, where are readers available.

I attached a zip folder containing a keras model (for blood cell classification), the same model as .pb and the meta file (.xlsx) that was created during training.

Concerning ShapeIn

While the dnn module of OpenCV allows CPU and also GPU (NVIDIA) support, I would not suggest to care about GPU support yet. If the model is supposed to be used in ShapeIn in real-time, it has to be rather small. For small models, the advantage of heavy parallelism of a GPU cannot be employed. Maybe you even want to process one image at a time (sorting) - then it becomes even harder to have an advantage on GPU. GPU inference can be slower due to the data transfer-time to GPU and CUDA overhead.

Concerning ShapeOut

Here, one can predict large batches of images at once and there is no time limit. Hence, also larger models are feasible and GPU power can be advantageous. For AIDeveloper, I used tensorflow-gpu together with PyInstaller to create a standalone that can automatically detect many NVIDIA GPUs. Users don't even need to install CUDA. If there is no GPU, CPU implementations are used automatically (this is all handled by tensorflow).

What if an event does not belong to either of the given classes?

this is a very interesting question without easy solution. For classification, the final activation function for prediction is typically a softmax. Softmax gives you for each class a probability between 0 and 1 and the sum over all classes has to be 1. Hence, even if you forward an image that is very different from what you trained on, each class will get some probability. I like the idea to include a "rubbish" class during training which is trained on arbitrary images. If the final model is then presented with a strange image it might give some probability to that "rubbish" class.

Blood_Model.zip

paulmueller commented 4 years ago

@maikherbig Thanks for the insight. I guess it's gonna be .pb then. BTW we are not reinventing the wheel. We only need additional metadata about the model and the features used, and possibly additional Python scripts (as listed above). Even if the .pb file format supports all of these things, I would also like to hash the .pb file to verify it's the correct version. JSON files are human-readable (with proper line breaks and indentation of course).

I will post a mock-up of a .modc file as soon as I have time to get to it.

maikherbig commented 4 years ago

Few more thoughts.

json vs xlsx

To get a model, you train it sometimes for thousands of iterations. Each time, a few metrics are saved (accuracy, f1,..). Hence, the result is a table. That information belongs to the meta-information of the model. I find it convenient to save (small) tables as excel file as everybody can open and work with them immediately. Is there a particular advantage of JSON for such relatively small tables?

TensorFlow vs OpenCV' dnn module

I suppose you are still trying to minimize the size of ShapeOut. Therefore, I would like to point out that TensorFlow has actually a footprint of approx. 460MB. Maybe a more memory efficient solution would be to go for OpenCV's dnn module, which was already suggested to be used in ShapeIn. Using the same tool in ShapeIn and ShapeOut might also make it easier to synchronize.

paulmueller commented 4 years ago

JSON is actually not really good for tables. Here, I would prefer tsv. I agree that the evolution of the model is important information (and so would be the training data - are you storing hashes of the raw training data?). I would suggest a separate folder "supplements" in the zip file for such information. Since there will not be strict rules for supplements, I think xls is fine there.

OpenCV would indeed be a viable option, because nowadays they have wheels for all platforms.

maikherbig commented 4 years ago

Yes, the meta-files of AID also contain the hash (have a look into the .zip I sent earlier). This is very useful when loading such a meta file to restore a session in AID. If the path to the files has changed, the hash is used to find the files back. Loading such a session restores the table of data used as shown below. This means information is loaded which files are used and also how many images from which file per training iteration. LoadSession

maikherbig commented 4 years ago

@alfrem just told me about this new tool: https://github.com/uber/neuropod Neuropod allows to run inference for many model-formats. Neuropod provides tools to easily convert keras models to the neuropod format. This neuropod format can then be used for inference identically on Python and C++! Maybe it would make sense to compare the inference times (pure tensorflow, OpenCV-dnn, neuropod and maybe also keras2cpp).

paulmueller commented 3 years ago

Exporting models from tensorflow and loading them with OpenCV is not trivial.

The model has to be exported as a frozen graph (e.g. https://github.com/opencv/opencv/issues/16582#issuecomment-603819498, https://github.com/opencv/opencv/issues/16879#issuecomment-603815872).
This does not seem to work with all models (https://stackoverflow.com/questions/60826380/opencv-cant-create-layer-map-shape-of-type-shape-in-function-getlayerinst), because sometimes a manually edited .pbtxt file is required (https://jeanvitor.com/tensorflow-object-detecion-opencv/).

TODO:

Further reading: https://sayak.dev/tf.keras/preprocessing/2020/04/13/embedding-image-preprocessing-functions.html#Step-3:-SavedModel-plunge
Further reading: https://medium.com/@sathualab/how-to-use-tensorflow-graph-with-opencv-dnn-module-3bbeeb4920c5
Find out how hard it is to generate .pbtxt files (e.g. https://github.com/opencv/opencv/blob/master/samples/dnn/tf_text_graph_ssd.py)

maikherbig commented 3 years ago

Good news

it works with OpenCV 4.4.0.42! I used to have "opencv-contrib-python-headless 4.1.1.26" and got all sorts of errors when trying to load and use a frozen .pb file (generated using the model conversion tool in AIDeveloper). After an update using: "pip install opencv-python-headless==4.4.0.42", everything suddenly worked perfectly! I wrote a script (aid_OpenCV_dnn.py) that contains a testing function (test_opencv_dnn), which tests if the original model (.model) and the frozen model (.pb) return the same predictions. The attached .zip contains all required files: OpenCV_dnn_test.zip

When having a look at the testing function you will see how to:

find the input image size, required by the model
find the image normalization function that was used when the model was trained (same normalization needs to be applied when using the model)
apply cropping and normalization method to rtdc images
load a model (.pb)
forward an image through the model to get predictions

I also tested another model (which has a dropout layer) -> no issues :)

the attached zip also contains aid_bin, aid_img, and aid_start which are scripts from AIDeveloper (https://github.com/maikherbig/AIDeveloper/tree/master/AIDeveloper)

How to convert (freeze) a model in AIDeveloper?

Open AID
Go to the History tab
Click "Load Model" (button on the left)
Choose target format ("Frozen TensorFlow .pb" or "Optimized TensorFlow .pb") in drop-down menu on the right
Click "Convert" (button on the right)

paulmueller commented 3 years ago

I am closing this one, because the basic .modc model file support is now implemented in dclab 0.30.0. There is a write-up of all the new functionalities in the new ML section of the documentation: https://dclab.readthedocs.io/en/stable/sec_av_ml.html

For all the other issues that were raised during this conversation, please see:

82
83
84

DC-analysis / dclab