Progress on rasdaman (Deep Learning) UDFs

KathiSchleidt commented 1 year ago

What's the status on creating rasdaman UDFs? The requirements were discussed in Bremen, should be clear. If not, please ask! Details in the UC2 presentation from Bremen.

ocampos16 commented 1 year ago

@KathiSchleidt as of right now we are still working on the following:

Linking the Python pytorch implementation from Rob into the UDF mechanism. The idea is the replace the existing c++ implementation so that python can be used instead, this will definitely simplify future UDF implementations as well as reduce development time.
Saving a trained model as a collection in rasdaman for further reference from other UDFs.
Designing a catalog mechanism for listing and linking what models can be used with what UDFS.

We will keep you updated with our results as they come.

KathiSchleidt commented 1 year ago

@ocampos16

Very cool! Think being able to create Python based UDF will make this much easier for "normal" users! :)
ah... what's a collection in rasdaman?
This work should be coordinated with what @sMorrone is doing on D4.3 Processing Resource Metadata

More generally (and maybe contained in points 2&3), how can a user see what UDF are available? Or can users only access their own UDF?

ocampos16 commented 1 year ago

@KathiSchleidt

Indeed I believe the same that is why we are focusing all efforts towards this solution.
It means storing the model inside rasdaman. A collection in rasdaman is equivalent to a table in a relational database.
@sMorrone maybe we can have a quick concall to discuss how we relate your catalog with what rasdaman could provide

More generally (and maybe contained in points 2&3), how can a user see what UDF are available? -> There is a query in rasdaman query language, rasql, that is specifically designed to list all available UDFs, regardless of the user. I believe that in a web environment using WCS, WCPS, or WMS would be preferred, this part I need to check with @pebau because this involves a standard, if not then we need to think of another solution. Or can users only access their own UDF? -> So far any user can access all the UDFs rasql and WCPS, is this acceptable to you?

KathiSchleidt commented 1 year ago

On providing a listing of available UDF, to my view, WCPS getCapabilitities would be my first candidate, in addition to exposing via the processing resource metadata. Please include me on the call sorting this!

On all users being able to access existing UDF, works for me. We should check with the UC partners just to be sure, but pretty sure we won't have the issues we have with sensitive data on sensitive models.

robknapen commented 1 year ago

ML models trained on sensitive data might need restricted access as well. For instance depending on the user agreement of the data (what derived products are allowed, often not clearly specified for ML models), or wether the training of the model has sufficiently hidden the sensitive (input) data points (otherwise an ML expert might be able to extract them from the model, as a kind off reverse engineering).

robknapen commented 1 year ago

@ocampos16 Out of curiosity (also relates to 'how to catalogue' and 'what might be restricted'): Do you intend to treat a trained model as a whole, or to split it up into the computational graph and the trained parameters?

pebau commented 1 year ago

@robknapen (chiming in here) dissecting a model is a rabbit hole from our perspective, and I can see no advantage - we would treat a model always as a black box.

pebau commented 1 year ago

@robknapen

ML models trained on sensitive data might need restricted access as well.

Accepted, at some time access control will be necessary - just not at this stage where we have only 1 anyway :)

KathiSchleidt commented 1 year ago

@robknapen turning @pebau statement around, do you see a situation where we provide the same model with 2 sets of trained parameters?

robknapen commented 1 year ago

Sure, for example the same CNN model that we used so far can be trained for other (semantic segmentation) tasks (similar though, since the model architecture expects 28 features as input), or it can be trained for a different region. Both would use the same model architecture (= computational graph), but learn different weights. Splitting these two is the basis for what is known as transfer learning in ML. So for inference you can have a model architecture and load it with matching weights and biases for a number of similar prediction tasks. [For sure this is more difficult to implement than a pure black box approach and there might be no short term benefits.]

Libraries such as Tensorflow, Keras, and PyTorch all have methods that support this type of working with deep learning models. The usually long training times makes it a rather common approach to quickly start experimenting.

pebau commented 1 year ago

status: pytorch-based UDFs work, Jupyterhub almost installed (need Rob's help for completion -> Mohit will contact)

KathiSchleidt commented 1 year ago

@robknapen am I correct that if you have a model trained on 2 different datasets, you'd provide this as 2 different models (most of the info the same, but different input data, maybe different spatial validity)?

robknapen commented 1 year ago

@KathiSchleidt Yes, the models learn to represent the different datasets. When they are 'too different', it will result in distinct models. When the datasets are different but still similar, a single, more robust, model can be trained on them. So there can be exceptions :-)

KathiSchleidt commented 1 year ago

@robknapen any insight as to what impact these exceptions have on the a/p resource metadata? There, we have the following fields forseen:

Input data:URI 1..* : Link to input data/metadata, helpful for a better understanding of context and domain.
Characteristics of input data: CharacterString 1 : This field contains a textual description of the main characteristics of each input data to the resource. This field will also include e.g., description of sampling techniques, version of the data (if multiple versions are available), and, in the case of ML resources, also the percentages of training, validation and testing sets. This field may contain details on the suitability of the resource for the chosen geographic area and thematic context.

Can you use these to describe what you'd need to know?

robknapen commented 1 year ago

@KathiSchleidt I think so. In some cases I would mention an existing (trained) model (or its saved weights) as ‘input data’, and use ‘characteristics’ to explain how it was used.

(Maybe we need a better minimum length for ‘characteristics’? 1 Character doesn’t seem very helpful to me. I would prefer either 0, or enforce some longer text (200+ characters?).)

KathiSchleidt commented 1 year ago

@robknapen

shouldn't we differentiate between:
- Input Data: data the model has been trained on
- Configuration/weights: how the model has been parameterized
on Characteristics of input data, this is of type CharacterString, so free text. This has worried me, as difficult to explain the individual inputs in such a block, but my request to align the cardinality with Input Data was not taken into account

@sMorrone

Should we add an entry for model configuration/weights?
Should we align the cardinality of the input data description with that of the input data?

robknapen commented 1 year ago

@KathiSchleidt Yes, we can split it into configuration/initialisation data and input (training) data, to make the difference in purpose more clear.

sMorrone commented 1 year ago

@KathiSchleidt

we agree on adding an entry for model configuration/weights & will do asap
pertaining to the "align the cardinality of the input data description with that of the input data", current solution we have implemented (a couple of months ago) is using bulleted lists in which each entry is paired with its characteristics. Pic below refers to current online a/p md request form.

When the MD is displayed in the catalog, this solution turns out in what can be seen in pic below

@robknapen @KathiSchleidt does this work for you?

pebau commented 10 months ago

Summarizing the status of rasdaman UDFs:

trained models + datacube regions of interest can be passed for evaluation to pytorch using the UDF mechanism; the corresponding UDF package nn is deployed, it offers function predict() for this purpose.
general python UDFs can be created through a create function statement and copying the code into the rasdaman UDF space (those users who have worked on this already have a login, other prospective users please contact us to create a login).
an example model provided by WER has been deployed as a proof of concept on https://fairicube.rasdaman.com .

Let me know if you feel something missing on pytorch UDFs.

jetschny commented 9 months ago

Jivitesh is now assigned to look into the python UDF implementation (testing and verification). this will provide another UC view and can serve as validation.

jetschny commented 4 months ago

in light of the new issue which formulate the requirements for more ML models in short

https://github.com/FAIRiCUBE/FAIRiCUBE-Hub-issue-tracker/issues/57

I will close this ticket.

FAIRiCUBE / FAIRiCUBE-Hub-issue-tracker

Progress on rasdaman (Deep Learning) UDFs #2