Closed KathiSchleidt closed 4 months ago
@KathiSchleidt as of right now we are still working on the following:
We will keep you updated with our results as they come.
@ocampos16
More generally (and maybe contained in points 2&3), how can a user see what UDF are available? Or can users only access their own UDF?
@KathiSchleidt
More generally (and maybe contained in points 2&3), how can a user see what UDF are available? -> There is a query in rasdaman query language, rasql, that is specifically designed to list all available UDFs, regardless of the user. I believe that in a web environment using WCS, WCPS, or WMS would be preferred, this part I need to check with @pebau because this involves a standard, if not then we need to think of another solution. Or can users only access their own UDF? -> So far any user can access all the UDFs rasql and WCPS, is this acceptable to you?
On providing a listing of available UDF, to my view, WCPS getCapabilitities would be my first candidate, in addition to exposing via the processing resource metadata. Please include me on the call sorting this!
On all users being able to access existing UDF, works for me. We should check with the UC partners just to be sure, but pretty sure we won't have the issues we have with sensitive data on sensitive models.
ML models trained on sensitive data might need restricted access as well. For instance depending on the user agreement of the data (what derived products are allowed, often not clearly specified for ML models), or wether the training of the model has sufficiently hidden the sensitive (input) data points (otherwise an ML expert might be able to extract them from the model, as a kind off reverse engineering).
@ocampos16 Out of curiosity (also relates to 'how to catalogue' and 'what might be restricted'): Do you intend to treat a trained model as a whole, or to split it up into the computational graph and the trained parameters?
@robknapen (chiming in here) dissecting a model is a rabbit hole from our perspective, and I can see no advantage - we would treat a model always as a black box.
@robknapen
ML models trained on sensitive data might need restricted access as well.
Accepted, at some time access control will be necessary - just not at this stage where we have only 1 anyway :)
@robknapen turning @pebau statement around, do you see a situation where we provide the same model with 2 sets of trained parameters?
Sure, for example the same CNN model that we used so far can be trained for other (semantic segmentation) tasks (similar though, since the model architecture expects 28 features as input), or it can be trained for a different region. Both would use the same model architecture (= computational graph), but learn different weights. Splitting these two is the basis for what is known as transfer learning in ML. So for inference you can have a model architecture and load it with matching weights and biases for a number of similar prediction tasks. [For sure this is more difficult to implement than a pure black box approach and there might be no short term benefits.]
Libraries such as Tensorflow, Keras, and PyTorch all have methods that support this type of working with deep learning models. The usually long training times makes it a rather common approach to quickly start experimenting.
status: pytorch-based UDFs work, Jupyterhub almost installed (need Rob's help for completion -> Mohit will contact)
@robknapen am I correct that if you have a model trained on 2 different datasets, you'd provide this as 2 different models (most of the info the same, but different input data, maybe different spatial validity)?
@KathiSchleidt Yes, the models learn to represent the different datasets. When they are 'too different', it will result in distinct models. When the datasets are different but still similar, a single, more robust, model can be trained on them. So there can be exceptions :-)
@robknapen any insight as to what impact these exceptions have on the a/p resource metadata? There, we have the following fields forseen:
Can you use these to describe what you'd need to know?
@KathiSchleidt I think so. In some cases I would mention an existing (trained) model (or its saved weights) as ‘input data’, and use ‘characteristics’ to explain how it was used.
(Maybe we need a better minimum length for ‘characteristics’? 1 Character doesn’t seem very helpful to me. I would prefer either 0, or enforce some longer text (200+ characters?).)
@robknapen
@sMorrone
@KathiSchleidt Yes, we can split it into configuration/initialisation data and input (training) data, to make the difference in purpose more clear.
@KathiSchleidt
When the MD is displayed in the catalog, this solution turns out in what can be seen in pic below
@robknapen @KathiSchleidt does this work for you?
Summarizing the status of rasdaman UDFs:
nn
is deployed, it offers function predict()
for this purpose.create function
statement and copying the code into the rasdaman UDF space (those users who have worked on this already have a login, other prospective users please contact us to create a login).Let me know if you feel something missing on pytorch UDFs.
Jivitesh is now assigned to look into the python UDF implementation (testing and verification). this will provide another UC view and can serve as validation.
in light of the new issue which formulate the requirements for more ML models in short
https://github.com/FAIRiCUBE/FAIRiCUBE-Hub-issue-tracker/issues/57
I will close this ticket.
What's the status on creating rasdaman UDFs? The requirements were discussed in Bremen, should be clear. If not, please ask! Details in the UC2 presentation from Bremen.