Proposal for rasdaman jupyter notebook template

ocampos16 commented 1 year ago

I have updated the repo with the newest version of the jupyter notebook for the rasdaman ML UDF proof of concept. This will be the endpoint for the Resource metadata under the Reference subheader.

It would be nice if we agree on the structure of this notebook so that it can later be used as a template for the other ML UDFs. Maybe even use it to programmatically generate the preamble of new ML notebooks with the metadata provided by Resource metadata.

robknapen commented 1 year ago

Indeed, good to streamline it a bit and clarify what should go into the resource metadata, a reference notebook (if available) and an example (also a notebook)? And how one can automatically be created from another, otherwise it will be a lot of work to fill everything in and keep it in sync (as there might be multiple variants and versions of models).

robknapen commented 1 year ago

Maybe this should be moved to the central FAIRiCUBE Hub issue list? I suppose something similar will apply to a/p resources on EOX.

pebau commented 1 year ago

my 2 cents:

don't know what a reference notebook is. We planned to do just the example one.
I do not believe we can automate this fully, data are just too heterogeneous and need different handling.

KathiSchleidt commented 1 year ago

First off, I transferred the issue to the general FAIRiCUBE-Hub-issue-tracker as you correctly noted that it's far wider than UC2 (Hope this works!)

Comparing the information in the rasdaman ML UDF proof of concept with the corresponding A/P Resource Metadata, I see some divergence, e.g. on the Inputs:

Jupyter NB:
- sentinel2_image: [subset] of preprocessed sentinel2 image (ingested to rasdaman)
- maxes_sentinel2_image:: Per band maxes of the whole sentinel2 image (ingested to rasdaman)
A/P Resource MD:
- Feature data: 7 Sentinel-2 images, R,G,B,NIR bands, representative of the Dutch growing season 2018. The data was in UTM projection and only cloud free images have been used. It covered a study area in the North-East of the country.
- Label data: The Dutch agricultural land registration data from 2018 of the study area has been used as ground truth data. It contains the farm parcel boundaries and the planted crops. The full list of crops has been reduced to 76 major types that were at least present in the region and thought to be potentially recognisable from the feature data. Still, the labels are significantly imbalanced.

Defining a base structure for the Jupyter NBs that's aligned with the A/P Resource MD would be very valuable, make it far easier to maintain alignment between the NBs and the MD describing them. @sMorrone could EPSIT provide a first proposal for this?

sMorrone commented 1 year ago

I see your point. However, I am not so much in favour of generating the preambles of new ML notebooks with the metadata provided by Resource metadata. In my opinion, this would somehow lead to JN containing information that is very different from the content one would expect to find there, i.e. too much detail . I would keep the two things (JN and a/p MD) separate. Maybe a good idea would be linking from the JN to the related MD. What do you think? However, it is important to start creating metadata for the actual resources in the use cases.
We could label what is already in the issue tracker as "test" and ask the UCs to start creating 'true' metadata now that they have something concrete to create metadata for. Or maybe we just close the current issues to have a 'cleaner' issue tracker?

KathiSchleidt commented 1 year ago

I like the idea of agreeing on a way of providing a link to the a/p MD resource from the JN. Saves us the duplication, all information is where it's required.

On issues, I'd prefer truly deleting the initial tests, as I'm very much hoping that we'll also close real issues. Then we can no longer differentiate if an issue was closed because it was a non-test-issue, or closed because it was resolved. While I'm aware that we can sort a lot with labels, I think it would be far easier to handle if we didn't always have to filter out test issues

KathiSchleidt commented 11 months ago

We seem to have agreement on not providing too much additional metadata within the JN, instead just providing a link to the a/p resource record (direct link to STAC JSON)

However, we've lost the original topic of this issue, providing clean JNs that illustrate tricky bits as templates. I've heard requests for the like from the UC over the last months, so believe this point is still valid. These would also be valuable for the KB.

Question is where to collect them? Maybe in the code section of FAIRiCUBE-Hub-issue-tracker as neutral ground? @jetschny thoughts?

robknapen commented 11 months ago

I don't think we have adjusted these ideas yet after the rasdaman UDF approach has changed from supporting C++ code to being able to run Python code. The old proof of concept (code) and (template) Notebook might no longer apply? It might be good to have the seminar about the new Python UDFs first, and then discuss how it affects what we got from the initial proof of concept and what needs to be updated?

KathiSchleidt commented 11 months ago

@pebau @robknapen is there already a date for the seminar about the new Python UDFs? Maybe in the frame of the data science seminars or UC Synergy WSs? Think that the UDF topic is interesting for all UC.

pebau commented 11 months ago

using the python UDF is quite straightforward:

you know aleady how to create a UDF as such
now you use, in the "create function" statement, "language python"
put the python code into /opt/rasdaman/share/rasdaman/udf/mylittleudf.py
in the UDF code, simply use the input, like:

import numpy as np
def mean(arg):
    return np.mean(arg)

Curious on your plans, maybe you can share. But then again: the Recommended Good Practice is that you first use Jupyter to load data from rasdaman and do the python processing, and only then move the code into the UDF. At the same time, this yields a nice developer documentation of the UDF.

Bottom line and good news: you don't need to wait for continuing.

KathiSchleidt commented 11 months ago

@pebau where is this documented outside @robknapen and your brain?

From my experience, examples/templates are very useful, have been requested by the UC

pebau commented 11 months ago

the tutorial will have material.

robknapen commented 11 months ago

@pebau Our plans are still to use the functionality for deep learning on multi-dimensional input data. So going from applying a simply numpy mean function, can you provide an example that works similar like what we had in the C++ world? So input is a spatial data cube (via the Java Petascope connection?), and output a new spatial data cube that has predicted class values and prediction probabilities. It is fine to start with that first and later consider other output dimensionalities and tracking progress and prediction accuracies.

robknapen commented 11 months ago

Ok, posts crossed. Fine if it is in the tutorial material. I can wait for that.

pebau commented 11 months ago

@robknapen it would be tremendously helpful to have a Jupyter notebook demonstrating your exact plans - I have no clue what data to be accessed, how they will be selected, where does the model come from, etc. Many thanks!

So far the tutorial can cover exactly rasdaman UDFs as sletched here in the issue, nothing more.

robknapen commented 11 months ago

Fair enough, so best then to let our use case progress further first developing the science and methodology we need, locally code the Notebooks for the machine learning involved, and get back to you with those sometime later next year. We can then see what can still be incorporated as UDFs in rasdaman and how to do that.

pebau commented 11 months ago

sounds good, Rob - looking forward to some JN; can be rough and sketchy, no need to polish for us, just to give us an understanding of what course the ship goes :)

robknapen commented 11 months ago

Still a bit unclear to me if this issue still applies and what to do with the previous C++ based UDF and example I worked on with Otoniel. Do we discard all that?

pebau commented 11 months ago

@robknapen definitely not discard, but you can (and should) continue using it. Only once you want to do things that require direct python access you need to resort to the new path.

Would you share your big picture with us, what actually you are heading for? This would allow us to give even better support. For now it looks like you have the opportunity to run a model, and that was the goal which now is achieved. What else?

robknapen commented 11 months ago

@pebau If I recall correctly, and please correct me if further developments have happened, we left the C++ UDF proof of concept at a prototype stage that was dedicated to running the example Sentinel 2 input based Dutch crop classification model. Which was an example that I provided to get development going quickly, but it is not something we want to do in our use case. There we want to do species distribution modelling / estimation, and casual inference between farming activities and biodiversity changes (<= that is still our big picture). From what I remember about the C++ UDF code it will not suffice, for example because required input data will be different which will not match the stuff that was hard coded (like the data normalisation). We left that PoC with a few open ends, that might not have been addressed because the development path switched to prefer the Python based solution.

KathiSchleidt commented 11 months ago

@robknapen to my understanding, the initial C++ model you'd deployed as a UDF was a pure proof-of-concept, nothing to do with your actual UC. Thus, while I'd be all for storing this code for future reference (nice example of a C++ UDF), not anything we'll actually be using. Please provide to @sMorrone for inclusion in the KB.

@pebau in order to enable Rob to deploy actual ML routines (that require Python), getting the Python kernel running #14 is a prerequisite. I don't see how understanding details of the content of Rob's UC will accelerate that process.

robknapen commented 11 months ago

@KathiSchleidt I have already provided the resource descriptions for both the data and the model. Fine by me if you want to keep using it as a dedicated hard code C++ UDF showcase.

jetschny commented 8 months ago

Jivitesh is now assigned to look into the python UDF implementation (testing and verification). this will provide another UC view and can serve as validation.. he will now also clarify with the rasdaman counter part (Bang) how to create a suitable python kernel.

bangph commented 8 months ago

@jetschny this task (UDF in python / C++) is more suitable to our rasdaman technical leader: Dimitar (username: misev). I put him on the ticket, so please discuss with him instead.

jetschny commented 8 months ago

@bangph : you have been appointed by Peter to be the main data science contact point for FiC users, specifically Jivitesh. Please clarify internally if it should be Dimitar instead

misev commented 8 months ago

Python UDFs are documented in 4.18.6. Writing Python UDF Code along with examples, and registering such a Python UDF in rasdaman is covered in 4.18.3. Creating a UDF.

I'll gladly help with any questions on details you may have. As far as I know everyone should have credentials to access the above links.

@jetschny Bang is the primary contact and he connects me when that is needed, such as in this case.

robknapen commented 8 months ago

Hi @misev , I found the documentation on the fairicube.rasdaman.com VM that we are using for the project. It mentions an interesting example, but it is rather brief and leaves out many details. Is it possible to get a walkthrough of the full example and code? Perhaps in a FAIRiCUBE webinar. Probably at least @jivitesh-sharma will be interested as well.

jetschny commented 8 months ago

I would like to propose the 20.02. 13:00 -14:00 for such a webinar, it is a FAIRiCUBE common topic seminar, should be reserved in all FiC member calendars already...

misev commented 8 months ago

Hi Rob, the code left out from that example is not relevant to the mechanism of how to write the UDF-specific parts of a UDF and register it in rasdaman. The relevant part is the function signature (parameters and result returned), while what is done with the parameters to achieve the result is code specific to a use case: I imagine you already have this use case code and are just missing the part where you connect it to rasdaman as a UDF?

Which particular details of the example are unclear?

@jetschny I will not be available in the mid two weeks in February.

robknapen commented 8 months ago

Hi Dimitar, that is correct of course. However, a walkthrough of code and some additional explanation usually helps in getting up to speed faster, and probably will save us (as novice rasdaman users) some time otherwise spend on trail and error and restarts of rasdaman on the VM.

misev commented 8 months ago

If you have the Python code ready I could look at creating the UDF initially in rasdaman, so we can make progress faster. In any case the process with Python UDFs is a lot more straightforward in comparison to the complexity of C++.

robknapen commented 8 months ago

Alright, I guess we need to work first then on getting more UC data in rasdaman and training a model based on that. And then can come back to the UDF model inference code.

pebau commented 8 months ago

@robknapen indeed, you can help us in working out the webinar:

Which particular details of the example are unclear?

Your input will be valued for crafting the tutorial.

The (single) example we have is already good enough for doing a tutorial.

robknapen commented 8 months ago

Excellent, happy to help.

I think the tutorial/webinar is not just for me :-) so an end-to-end tutorial about Python UDFs for machine learning inference using PyTorch or TensorFlow would be great. If you want to use our crop classification example for it please go ahead.

Specifically to the example in the manual my (more detailed) questions are e.g. about error propagation, logging mechanism, how to return multiple outputs (e.g. classification + probabilities), how to create a geospatial output (I assume everything works via WCPS as well), where are the files with trained weights stored (is there a standard command to get a list?), how are these versioned, what is the python environment used, how is it updated, Is the mentioned EfFormer indeed an actual Python module or just a single file? Etc.

And perhaps for FAIRiCUBE, how does it link into the catalogues?

pebau commented 8 months ago

@robknapen I enjoy how we dive into it step by step - I see already many sub-topics arising from this which allow to spawn sub-issues. Anyway, the original topic does not fit any longer here. OK to continue under UC2?

robknapen commented 8 months ago

We moved (this issue) from UC2 specific discussion to this more general repo, so I would suggest to keep it here :-) But renaming or starting separate issues as needed sounds good. Maybe best following the webinar?

misev commented 8 months ago

@robknapen that's very helpful, I'll try to cover some of those topics in the documentation as well next week.

In WCPS it is also necessary to create a UDF that invokes the rasql UDF (doc); we're working out how to semiautomate this part as it is relatively straightforward.

misev commented 7 months ago

@robknapen @jivitesh-sharma we updated the documentation with more details on error handling, logging, PYTHONPATH and Python versions, and a note about multiple return values:

Complex return types, such as tuples or objects are not supported. Multiple arrays as long as they are of the same shape (spatial domain) can be returned as one multiband numpy array.

This part is more specific:

where are the files with trained weights stored (is there a standard command to get a list?), how are these versioned

These files are not managed by rasdaman. They have to be stored with system permissions that allow the rasdaman system user (that executes the Python UDF) to read the files.

pebau commented 7 months ago

@jetschny as all seems to be done excpet that an intro webinar was desired, can we schedule that? Would you take a lead, or should we?

jetschny commented 7 months ago

webinar was scheduled already for the 20.02. 13-14:00 during the "FAIRiCUBE common topics seminar" slot. If not suitable, glaldy reschuled or get back to me. Please arrange with Rob & Jivitesh what to cover (re-cap from previous activities). I cannot attend but trust all parties to inform each other and to record the session for me.

pebau commented 7 months ago

ok, so we will fit it in there - perfect. I will take care.

jetschny commented 3 months ago

as the webinar was given, content was documented, can this issue be closed?

pebau commented 3 months ago

I am not aware of anything unresolved in this issue, so closing it. Everybody feel free to reopen if a question remains, or add a new issue.

misev commented 3 months ago

@robknapen I've reimplemented the C++ UDF as a Python UDF versioned here.

You'll notice it's much more straightforward and I think it's a good basis for implementing support for further models. This UDF is deployed on our fairicube VM and available in rasdaman, I compared outputs of both the C++ and the Python one and they are equivalent.

robknapen commented 3 months ago

@misev Definitely much more pythonic :)

FAIRiCUBE / FAIRiCUBE-Hub-issue-tracker

Proposal for rasdaman jupyter notebook template #13