Closed soxofaan closed 1 year ago
Have you had a look at #319 (WIP)?
none of the current back-ends (or clients) implemented the user file storage "microservice" of the openEO API
I'm confused. It's present since years in JS, Web Editor and R and also present in the GEE back-end. EODC also claims that it's available.
users often want to share files with other users
Indeed, sharing was never implemented or defined due to lack of time (for implementation). I'd be happy to work on a specification, but I don't see who would implement that anytime soon as other long-existing features such as the pure existance of a user file storage are not even present yet.
load from URL
That's something that is also missing from raster cubes, although for both vector and raster you could likely re-use load_result with a URL although it requires STAC, which is not really embracing vector data yet.
load_collection originally supported vector cubes, but that was removed
Yes, but actually only because we had not definition for vector cubes yet and the behavior was undefined. The intention there was to re-add it once we have that defined, which we are getting closer to.
I'm confused. It's present since years in JS, Web Editor
My bad indeed, too much VITO-oriented assumptions apparently, I should have explored this more.
load from URL That's something that is also missing from raster cubes,
Yes, but I think it's significantly more important for vector data. Handling raster data is usually a big data problem, which is not ideal to solve with user-facing URLs. (Input) vector data will typically be small data (e.g. just one relatively small file), and handling them through URLs is straight-forward, both for client and back-end side.
Have you had a look at https://github.com/Open-EO/openeo-processes/pull/319 (WIP)?
I missed that apparently when searching for related github tickets. Some quick notes:
load_collection
: this is about loading predefined vector cubes, which, as noted above, I don't expect to be a very practical solution for vector dataload_result
: this assumes the user already got a vector cube in the system somehow (chicken and egg) so it doesn't fundamentally solve the problemload_uploaded_files
: is most feasible of existing options, but comes with challenges (also noted above)
In short, I think the (technically) simplest and most versatile way to have vector cube loading functionality in the geopyspark driver and aggregator is loading from URL. In VITO and related backends we already support that in the read_vector
process, but we want to standardize this and make sure it fits nicely with the other load_
processes.
Several options:
load_uploaded_files
: currently only relative paths to user-uploaded files are allowed. Disadvantage is that the process name explicitly contains uploaded_files
, which is not ideal naming-wise for loading URLsload_uploaded_files
to load_files
and support URLsload_url
load_collection
: this is about loading predefined vector cubes, which, as noted above, I don't expect to be a very practical solution for vector data
Why? Backends could surely provide some larger commonly used datasets through that, e.g. https://developers.google.com/earth-engine/datasets/tags/boundaries
load_result
: this assumes the user already got a vector cube in the system somehow (chicken and egg) so it doesn't fundamentally solve the problem
In principle, you could load any URL, but this highly depends on the implementation. But we can surely add a "load_external" or so, which only allows loading external files from a URL. Shall I propose something?
load_external
sounds like good solution, but I'm wondering if it would differ enough (in terms of parameters and description) from the current load_uploaded_files
to justify a new/other process.
I think the modalities of loading the resources can be encoded with some kind of scheme (like done in other software):
file://
: user uploaded filehttps://
: load from URLI think they are distinct enough and while you can implement load_external always, load_uploaded_files only works if a workspace is present. So keeping them separate is better to avoid two versions of load_external in case no user workspace is present. I'd restrict load_uploaded_files to relative paths (user workspace access, should be status quo) and load_external to (aboslute) URLs, i.e. http(s), maybe ftp or so. Where I see more commonalities are load_result (with a URL) and load_external. Should load_external also be able to accept extents, bands etc?
(FYI: @jdries pointed me to https://github.com/Open-EO/openeo-api/issues/135 as well, which is closely related to this discussion)
Should load_external also be able to accept extents, bands etc?
at the moment I'm focused on loading vector data/cubes, so additional load_collection
-style filters are not high prio for me.
Moreover, the reason to have filter options inside load_collection
/load_result
is to allow back-ends to have load-time optimizations when loading from big data sets, based on metadata that is available at the back-end.
For load_external
this argument is probably not valid anymore: e.g. we're loading a relatively small number of predefined (relatively small) files, possibly without any metadata that can be leveraged to avoid loading unnecessary data.
So in a first iteration I don't think full alignment with load_collection
/load_result
is useful.
Hmm, I don't really understand how https://github.com/Open-EO/openeo-api/issues/135 is closely related? It's just another way of providing a user workspace for files, right?
Anyway, I'll post a PR for load_external later. Seems pretty straighforward.
I would really prefer to have the same process for loading files both externally or from a workspace. This would again lead to the problem where process graphs would need to adapt depending on where the file is coming from. If a user tries to load a file from an internal workspace on a backend that does not at all support uploading files, then where would the file path come from?
It's not a good idea that we design a process that supports two options, but one may not be supported. So back-ends need to change the process definition based on the API capabilities, which is prone to issues.
As the UDP is coming to the table again and again, we should probably focus on getting a solution for calling processes by parameter: https://github.com/Open-EO/openeo-api/issues/413 / https://github.com/Open-EO/openeo-processes/pull/307 - Eventually, this issue will always arise and you can't just reduce everything into one big process, e.g. all reducers. I mean the issue already exists if you want to load from various sources, e.g. from results or from a user-uploaded file.
If a user tries to load a file from an internal workspace on a backend that does not at all support uploading files, then where would the file path come from?
I don't understand the question. If you don't implement/support uploading files, you don't implement load_uploaded_files and as such don't need a path?!
The workaround where we call processes by parameter is still more complex than simply having one process. openEO is about solving complex things in the backend, so that users don't have to deal with it. With the parameterized process proposal, both the UDP implementor and user are facing more complexity. Generalizing the problem (like the reducers analogy) doesn't really help with the discussion, we need to evaluate pros and cons on a case by case basis. Sometimes there will be good reasons to split processes, in other cases not.
My question was actually trying to reply to:
So keeping them separate is better to avoid two versions of load_external in case no user workspace is present.
I don't get the problem with loading files when no user workspace is present?
But if this is now coming up on a nearly weekly basis, then it seems that we should look into a solution. We have the issue here for the reducers, it's not a generalization. It's an actual use case (outside of openEO Platform). How shall I solve it? It's simply not possible except I do a lot of if/else, but that would also be possible for loading data, of course.
The issue is that you need to adopt the process specification and remove one of the schemas if user uploads are not supported. That's all I meant. So it's not ideal seeing that many providers just copy&paste the processes without adapting them and as such exposing something that is not actually there. But I could say that's an issue for them, indeed.
No problem to work on parameterization of process names, it is a useful thing to have for certain cases.
It's indeed not ideal if backends claim support for something because of copy pasting, but it's already very good that a correct solution is in fact available through the schemas. If we then ever have a very advanced aggregator that automatically adapts strategy based on where a backend can load data from, then that would work. But this is all fairly futuristic, for now, users will indeed simply get an error if an unsupported feature is used, and then they can contact the backend.
It's not a good idea that we design a process that supports two options, but one may not be supported.
We already do that with the format
parameter: back-ends have the liberty to support different input/output formats.
Supporting different storage solutions feels like the same thing, which often is implemented in other software though a protocol/scheme prefix. This is a well established concept and it won't make the process description that much harder.
As a user I appreciate it when software or a library takes care of all the annoying file format details and storage details, and I just can use the same read
or load
function to load, for example, a local geotiff file or a remote netcdf file. I think openEO should cater for users that expect this level of simplicity.
I agree with @soxofaan and I would go with load_external where it's possible to load from an URL or (if available) from the user workspace. Allowing the loading from an external URL would also allow to use vector processes on one back-end and re-use the result on another one that does not support them yet.
I just can use the same
read
orload
function to load
Means eventually combining all load_* processes into a single one?
I agree with @soxofaan and I would go with load_external where it's possible to load from an URL or (if available) from the user workspace.
Okay, but user workspace is not external. So I'd propose renaming load_uploaded_files to load_file(s?) and add URL support.
Allowing the loading from an external URL would also allow to use vector processes on one back-end and re-use the result on another one that does not support them yet.
That's already captured by load_results?!
Okay, but user workspace is not external. So I'd propose renaming load_uploaded_files to load_file(s?) and add URL support.
yes, maybe load_file(s) fits better both scenarios.
That's already captured by load_results?!
from my point of view no. Currently load_result doesn't support vector_cubes and it has been updated to support the same parameters as load_collection (spatial_extent, temporal_extent and so on). So it has been moved to be more raster specific than general purpose.
Anyway, load_result across back-ends would work only within the federation (openEO Platform) and not with other back-ends (EURAC, GEE).
Currently load_result doesn't support vector_cubes
No process really supports vector-cubes yet, we are just adding it in right now. See #319.
and it has been updated to support the same parameters as load_collection (spatial_extent, temporal_extent and so on).
spatial/temporal can be used with vector, only bands is raster specific.
Anyway, load_result across back-ends would work only within the federation (openEO Platform) and not with other back-ends (EURAC, GEE).
That's maybe a restriction of openEO Platform, but the process doesn't specify such a restriction. It allows retrieving results by URL. So if you have published your result, then you could in principle load it from everywhere, even from GEE.
Means eventually combining all load_* processes into a single one?
Ultimately that could be an option, but I don't think we should aim for that at this point.
load_collection
and load_result
are a bit special in the sense that the storage details are not really standardized and mostly hidden for the user. It's probably feasible to replace load_result
with loading from URLs, but loading just by job_id is a nicer UI for the user I think.
So I'd propose renaming load_uploaded_files to load_file(s?) and add URL support.
:+1: load_data
could also be an option (which is a bit more generic than _files
)
Ultimately that could be an option
But a complicated one: How do you decide between collections, files, results? They are all just strings and you eventually will run into conflicts.
It's probably feasible to replace load_result with loading from URLs
Basically, the load_* processes are now structured as such:
load_data
The load* processes are named in a way that they reflect the "source". Data is so overloaded term and general that it could basically include all load* processes.
But a complicated one: How do you decide between collections, files, results? They are all just strings and you eventually will run into conflicts.
not necessarily if you use scheme prefix e.g. collection:SENTINEL1_GRD
, collection://openeo.vito.be/SENTINEL1_GRD
, job:12312-43223-2343
, job://openeo.cloud/23432543-234-23423
.
But then again: I don't think merging load_collection
or load_result
into a generic load_
process should be high prio at the moment. It could be something to consider for a v2 or v3 of the API.
load_files: Read from exactly one source file directly (i.e. required metadata needs to be included) by URL or path (uploaded by user) - no filtering due to single file and usually smaller size
FYI: load_uploaded_files
currently expects an array of paths
, not a single file:
https://github.com/Open-EO/openeo-processes/blob/d0ce91fcd347360b907ea2d9589d7564a2c1e1e3/proposals/load_uploaded_files.json#L12-L16
Going for single file is fine for loading vector cubes I guess, but for loading raster data, it's probably best to have an option to specify multiple files.
Good point regarding the array of paths. I guess then we should allow arrays or URLs too. So, replace all "single" with "multiple". So then the difference is that load_result points to STAC catalogs and load_files points to the data files (in STAC terminology: assets) directly.
The load* processes are named in a way that they reflect the "source". Data is so overloaded term and general that it could basically include all load* processes.
"data" is indeed very generic. My concern is mainly that "file" has a very "static" sound to it, while URLs can be more dynamic, e.g. it can be an on-the-fly query. But that's probably just a matter of POV, the back-end usually does not have to be aware of that and can consider it to be a "file".
Update to listing of https://github.com/Open-EO/openeo-processes/issues/322#issuecomment-1463846950:
load_url
(from #428)
While there are various discussions about how to conceptually define and handle vector cubes, I don't think we have already a standardized solution to load the vector data in the first place (except for inline GeoJSON).
I'll first try to list a couple of vector loading scenario's (with varying degrees of practicality and usefulness) and initial discussion of possible solutions (if any)
load_uploaded_files
(proposal), but it currently only supports returning raster cubes (it originally supported vector cubes, but that was removed in #68)read_vector
: user has the ability to upload/download/construct files in their Terrascope workspaceload_result
exists, but is raster cube output only at the moment, and parameter wise it is also very raster-cube-oriented.load_collection
originally supported vector cubes, but that was removed in #68