Data Service upload issues

aaraney commented 3 months ago

https://github.com/NOAA-OWP/DMOD/blame/e5aa5d50837bc751ede138ea213582a3ae173308/python/lib/core/dmod/core/dataset.py#L530

The addition of the domain parameter to the DatasetManager's add_data method in #608 results in both the websocket and rest dataset upload handlers to fail. Looks like I missed the impact of this change in my review of #608.

@robertbartel, can we weaken this requirement to an optional argument?

Code affected:

robertbartel commented 3 months ago

When there is a change to the data a dataset contains - add or delete - then the dataset's true domain has changed. We don't have everything we need yet to automatically detect this for any and every supported dataset, so we required the changes to the domain be sent along with any changes to the data.

So we can't simply take that out without introducing another breakage.

aaraney commented 3 months ago

So we can't simply take that out without introducing another breakage.

What about constructing a Generic data format / domain if a DataDomain is not provided?

robertbartel commented 3 months ago

That depends on what you mean. The dataset would get a valid data domain that way. But it wouldn't reflect the actual data. So the data service wouldn't be able to find a dataset and use it to fulfill requirements that the dataset could actually cover.

aaraney commented 3 months ago

The dataset would get a valid data domain that way. But it wouldn't reflect the actual data. So the data service wouldn't be able to find a dataset and use it to fulfill requirements that the dataset could actually cover.

Exactly! So, should domain metadata information be opt-in if you are using the CLI tools? Part of me says yes, but I am persuadable.

Aside, I imagine we would do something similar when we have detection capabilities developed. For example, a user uploads a file, we try to detect the domain of that data item, and we cannot detect the domain for the uploaded data item. I don't think we should fail the upload (the file could be quite large) if the upload didnt fail. The metadata detection is what failed, so in that case I think there is a strong argument for using a Generic DataDomain as a placeholder.

robertbartel commented 3 months ago

So, should domain metadata information be opt-in if you are using the CLI tools?

No, not at the moment at least. A proper domain definition for a dataset is necessary, regardless of the source of creation/upload, except perhaps for special, ephemeral situations. Without an accurate (and at least reasonably precise) domain, a dataset isn't really useful. And right now the only place detectors are used is on the client side.

The idea was to provide the available tools to help detect domains in the client, but also accept explicitly provided domains for cases where automated detection wasn't working yet, and then have the server-side API require the domain changes be given along with any data changes. Eventually, when we have working detectors for all formats, we can think about redesign this, but that's the intent right now.

I don't think we should fail the upload (the file could be quite large) if the upload didnt fail.

Again, I don't think that works. It prevents wasting the upload, but it also "disables" the dataset. I.e., if adding the last forcing CSV file fails, instead of a forcing dataset that covers all but one of the intended catchments, you've got something just taking up space until it is somehow fixed (and there is no mechanism for doing that right now).

Also, if we are expecting the updates to the domain to come from the client, I think we can have the failure before an upload, and thus avoid doing it at all.

robertbartel commented 3 months ago

One other wrinkle is that we not only need to detect/acquire an item's domain somewhere when adding a data item, but also ensure that the item's domain can be merged with the dataset's domain. If it can't be merged, once again, I don't think we want to disable a previously working dataset.

aaraney commented 3 months ago

Without an accurate (and at least reasonably precise) domain, a dataset isn't really useful.

I agree in the case of forcing, hydrofabric, and partition datasets, but IMO all other dataset variants don't really benefit from this metadata -- yet.

I don't think its necessarily a bad thing, at this phase in the project, that in some areas our dataset metadata is not "complete" (for lack of a better term). My main concern really comes down to timing and unknown unknowns. To me, this is another case of being the most downstream project with all upstream projects more or less being in alpha or beta. We have little control over upstream change to data formats (e.g. x BMI module init config) and that is okay. We are going to have to just deal with that for the moment. I don't think we should sink a ton of time into worrying about things that we know are not nearing their final form yet. IMO, I think the good citizen approach of adding metadata to a dataset at creation or after the data has been added is the best middle ground for where we are in relation to other projects that we depend on. Forcing, hydrofabric, and partition datasets can reap the benefits from this approach while other dataset variants don't have the same requirements.

In the future, I completely agree, updating a datasets data domain information is a requirement when modifying a dataset. I just don't think we are there yet and that is not a fault of our own or the upstream projects. IMO there are just too many unknown unknowns and axes of change that we cannot control.

aaraney commented 3 months ago

Also, if we are expecting the updates to the domain to come from the client, I think we can have the failure before an upload, and thus avoid doing it at all.

Yeah, I think that is fair if we are talking strictly about the CLI stuff. Would you expect that a service like the evaluation service would also do similar checks prior to uploading data?

aaraney commented 3 months ago

One other wrinkle is that we not only need to detect/acquire an item's domain somewhere when adding a data item, but also ensure that the item's domain can be merged with the dataset's domain. If it can't be merged, once again, I don't think we want to disable a previously working dataset.

In principle, I agree. I do think we need a way to do this forcefully though. So, a way to add data to a dataset without modifying a dataset's domain or modifying it such a way to denote that changes were forcefully made.

robertbartel commented 3 months ago

Yeah, I think that is fair if we are talking strictly about the CLI stuff. Would you expect that a service like the evaluation service would also do similar checks prior to uploading data?

Hmm ... not sure. Perhaps not. Though maybe so: are we using classes from dmod.client in the evaluation service for connecting to the data service (I think we want to)?

Regardless, I still think that a failed attempt to add new data items to a dataset should, in general, not reduce the usability of the dataset from what it was before the "add" attempt. If there are cases when that is undesirable - it initially seems unnecessary in your example, but not problematic - then we probably need to further change the API.

In principle, I agree. I do think we need a way to do this forcefully though. So, a way to add data to a dataset without modifying a dataset's domain or modifying it such a way to denote that changes were forcefully made.

I'm curious whether you mean "without immediately" here - i.e., the domain update will come, just separately - or completely without any domain update. Setting aside the "without immediately" case, while it should be safe for a dataset to contain data beyond what is described by its domain, is there a reason why we'd ever want this?

christophertubbs commented 3 months ago

Though maybe so: are we using classes from dmod.client in the evaluation service for connecting to the data service (I think we want to)?

Nope - the evaluation service isn't connected to the rest of the dmod services because it's not clear how to integrate with or even reliably run the data service. A dmod.client may be used when everything is ready, but REST and file operations are currently what's used to load data in the eval service.

aaraney commented 3 months ago

I agree that when we add data to a dataset we need to update the dataset's domain. At this point in time without dataset detectors implemented for all major dataset types we support this is tedious and just feels like another step. In the future, when this is more automated this will be extremely useful and a must have. Just in the present when we are trying to integrate with other parts of the system and dataset formats are continually evolving (e.g. changing / new bmi init config formats or hydrofabric formats / data model) I would rather slightly neglect some of the dataset intelligence features of DMOD (for now) and keep it dead simple and focus on getting larger components connected (e.g. eval capabilities, GUI) and circle back to this when we have more confidence in running models and conducting evaluations.

robertbartel commented 3 months ago

I would rather slightly neglect some of the dataset intelligence features of DMOD (for now) and keep it dead simple and focus on getting larger components connected

I largely agreed with your assessment, but I have a problem here: this seems like regressing rather than neglecting.

Two of the core features of DMOD are the data intelligence and orchestration. These are working for at least a meaningful portion of what's needed. I don't want to break that without a good reason.

Strictly speaking, I see a data service/dataset manager API that can't maintain domains when working with datasets as breaking things. Practically speaking, I expect it would nullify the utility of the API entirely. We - and any other users - would just manage the datasets, their data, and their domains manually when we actually need datasets in place for something.

But maybe that isn't right in practice, so if you can walk me through a counter-example in detail, that might be helpful.

All that said, I don't think I fundamentally understand why this is causing a problem for the client uploads. Why can't we just make sure that's provided somehow in the communication with the async service function mediating the uploads with the dataset manager?

aaraney commented 3 months ago

I largely agreed with your assessment, but I have a problem here: this seems like regressing rather than neglecting.

I see more of where you are coming from now. I thought we had a update_domain method on a DatasetManager that would allow you to, for example create -> add_data -> update_domain. Without a way of updating a domain, I agree with you that my suggestion is more inline with a regression. At the same time, we don't have in place (at the service and client levels, that is) to update a dataset domain when data is added.

Two of the core features of DMOD are the data intelligence and orchestration. These are working for at least a meaningful portion of what's needed. I don't want to break that without a good reason.

Strictly speaking, I see a data service/dataset manager API that can't maintain domains when working with datasets as breaking things. Practically speaking, I expect it would nullify the utility of the API entirely. We - and any other users - would just manage the datasets, their data, and their domains manually when we actually need datasets in place for something.

Likewise! I think we can have both with a primary focus on orchestration. My thinking is in the near term, dataset domain tracking would remain in a similar state to where it is now (within the services) - so, dataset domain information is provided at dataset creation and we could add an update_domain method or similar to enable modifying a dataset's domain, for example, after data has been uploaded. This gives client code a runway for integrating more completely with the data service and gives us the best of both worlds. For example, the evaluation service would't have to worry about including a dataset domain when results are uploaded in the first phase of integration.

My concern is not if we can do it, its more about timing and change propagation. Going strictly off the number of times a file has shown up in a DMOD commit, python/lib/core/dmod/core/meta_data.py takes the cake having been modified 89 times. This module contains the majority of classes that capture DMOD's concept of a Dataset and is a direct dependency of any client that interacts with the data service. If dataset domain information is opt in or can be retroactively updated, my inclination is that we will have an easier time keeping services functioning even if we don't have the bandwidth to prioritize upgrading services to new feature additions. In this future this attitude will have to change, but in the near term my thinking is that we should make it as easy as possible to get services hooked up.

I am likely blowing this out of proportion and making a fuss about something that is not a big deal though!

git log cmd

```shell git log --pretty=format: --name-only 8465473b | sort | uniq -c | sort -r | head -11 3298 89 python/lib/core/dmod/core/meta_data.py 87 python/lib/scheduler/dmod/scheduler/job/job.py 86 python/services/dataservice/dmod/dataservice/service.py 77 docker/main/ngen/Dockerfile 74 python/lib/scheduler/dmod/scheduler/scheduler.py 59 python/lib/communication/dmod/communication/client.py 58 python/lib/communication/dmod/communication/maas_request.py 56 docker/main/docker-deploy.yml 50 python/lib/scheduler/dmod/scheduler/job/job_manager.py 46 python/lib/core/dmod/core/dataset.py ``` `8465473b` is current `HEAD`

But maybe that isn't right in practice, so if you can walk me through a counter-example in detail, that might be helpful.

One assumption I am making is DMOD's model orchestration capabilities are more important than the data intelligence features. I think these features are 1 and 2 in level of importance, however, it is more important that DMOD be able to run a NextGen simulation even if that is staged manually.

With that in mind, consider using the CLI to stage data and start a NextGen job. The primary goal of the user is to run a NextGen job with the data they provide to DMOD. For example sake, lets say the user does not provide forcing data and expects that the data service will provide that. In order to stage the data, the user must be familiar with DMOD's concept of a DataDomain just to upload the data. If there is an issue in merging the DataDomain or just a user error with understanding the concept of a DataDomain, this stands in the way of starting a NextGen job via DMOD and I don't think at this point, it should. Sure, we could say, bypass the data service and upload the data directly to the data service using the minio cli, but that requires more knowledge of how the data service works.

All that said, I don't think I fundamentally understand why this is causing a problem for the client uploads. Why can't we just make sure that's provided somehow in the communication with the async service function mediating the uploads with the dataset manager?

Yeah, I don't think this fundamentally causes a problem per say. It just makes integration more tedious and requires a user / dev learn the concept of a DMOD DataDomain. In the future, I generally agree that it should be a requirement, I am just not sold on now being the right time is all.

robertbartel commented 3 months ago

I thought we had a update_domain method on a DatasetManager that would allow you to, for example create -> add_data -> update_domain.

So, ideally, I see modifications to datasets as atomic. We don't have the infrastructure for proper transactions, so the API calls have to handle things all at once, rather than, e.g., add data first and then subsequently and separately update the domain.

But on that I think it would be safe to relax requirements, at least temporarily, if it provides more convenience. I.e., we can add an update_domain function, and simply expect the client to carry out the entire "transaction." For adds, a failed/omitted domain update doesn't impact the usability of the dataset. For deletes, it would, but that's much less used in practice (and we could block it entirely for now if needed). And then we don't have to include the domain update as part of the add.

robertbartel commented 1 month ago

After discussion, there are a few task items that we've come up with:

[ ] Enforce an invariant that InitialDataAdder implementations do not modify domain from what is provided to create function, done after calling the adder, and warning and replace with original domain param if somehow it is changed
[ ] Update DatasetManager add_data implementations to make domain optional and default to None
[ ] Update ObjectStoreDatasetManager.add_data in particular to conditionally set the local updated_domain variable to merged only if domain param is not None, and set to the current dataset domain value otherwise (wrap the whole try block in conditional due to use of domain in logging message)
[ ] Declare and implement new update_domain function in DatasetManager and subtypes
[ ] Add boolean property to track if Dataset needs it's domain updated because it was altered
[ ] Validate that GENERIC format dataset should be marked to be scanned via above property
[ ] Assess places where update_domain needs to be used rather than passing a domain in an add_data call and adjust appropriately
[ ] Add IGNORED data format for supporting power users and having datasets just work, regardless of contents
[x] Decide how exactly to handle delete_data (leave alone for now)

NOAA-OWP / DMOD

Data Service upload issues #638