GeoNode / geonode

GeoNode is an open source platform that facilitates the creation, sharing, and collaborative use of geospatial data.
https://geonode.org/
Other
1.41k stars 1.11k forks source link

GNIP 100: Assets #12124

Open etj opened 3 months ago

etj commented 3 months ago

GNIP 100 - Assets

Overview

We need a way to identify files (local, remote, in the cloud...) per se. There's no way at the moment to identify data files by themselves, which are only referenced by the field `ResourceBase.files'.

Also, the StorageManager is pluggable, but only allows for a single storage backend at once. By having different subclasses of Asset (e.g. LocalAsset, S3Asset, ...) we may have a GeoNode instance handling datafiles on different data store backends.

Proposed By

Assigned to Release

This proposal is for GeoNode 4.3 (?)

State

Motivation

Proposal

We introduce the concept of Asset as generic data, that may be linked to a ResourceBase. A LocalAsset represents data stored in the filesystem (either a single file or a directory tree).

The Asset class will replace and augment the information stored at the moment in the ResourceBase.files field.

An Asset is associated with a Resource through a Link, which also tells the URL through which the Asset will be available to the GeoNode users.

Other usages for assets

Since the Asset object is quite simple, we could use it for other purposes as well; for instance, at the moment we use "unadvertised" ResourceBase instances for providing simple data to GeoStories (images, PDFs, ...). Instead of using such a heavy object, we could just use LocalAssets for this purpose.

Also, more Assets may be associated with an existing ResourceBase; this behavior replicates what GeoNetwork is already doing, that is having multiple data resources pointed by a single metadata record.

Permissions

In the future there could be different permissions for a Resource and its linked Assets, anyway for the sake of simplicity, as a first step we may grant on the asset the very same permissions of the linked ResourceBases.

In the case we want to associate an Asset to more than one Resource, the Asset will be available if the user has download privileges on at least one of the associated Resources.

Implementation

GeoNode asset diagram vpd

Model:

Logic:

DB migration:

API:

Authorization

A user has access to an Asset data iff such Asset is associated with at least one ResourceBase for which the user has download permissions.

Backwards Compatibility

Future evolution

Decoupled uploads

A user may upload an Asset without having to associate it to a Resource. Unassociated Assets may be used to automatically create ResourceBases and attach the asset to them.

Deprecate Documents

Once Assets gain their characterization, the Document object will not have much of a meaning, also considering that users upload as a Document any object that is not published as a Layer. This means that we will be able to remove the Document class, and convert its instances into ResourceBases with an Asset handling the former document's data.

Cleanup uploaded files

Some old installations have the uploaded data into /data.
The recent importer stores the uploaded data into .../STATIC_ROOT/uploaded, and GeoServer publishes the geotiff from that directory The final migration to Assets will store the files in .../STATIC_ROOT/assets, and GeoServer shall publish the files from there. In order to clean up such obsolete setups, a migration script could be done that:

Feedback

Update this section with relevant feedbacks, if any.

Voting

Project Steering Committee:

Links

Remove unused links below.

etj commented 3 months ago

Refactoring data upload procedure

The initial Asset implementation could be completely hidden to the GeoNode user, since the changes are only applied on the backend logic.

When a user uploads some data, the original data will be saved as an Asset.

Then, some heuristic will find the type of the uploaded data:

When the Resource is created, the Asset pointing to the uploaded data is linked to the ResourceBase via a Link (the new nullable asset foreign key shall be added). Examples:

We may need to split the logic 1) Upload data and create an Asset 2) Create a Resource from an existing Asset In this way, once we handle unassociated Assets, we may be able to run the creation of the related Resources in unattended commands.

etj commented 3 months ago

Authorization

An improvement comes for free with the Asset refactoring: at the moment downloadable files are public: if a URL for a data resource leaks out from someone having access to the Resource, such URL can be used by anybody to download the data file.

By checking the authorizations for the URL accessing the Assets' data, we'll add protection to the published data, allowing the download only to users having access to the Resource.

gannebamm commented 2 months ago

I generally like the idea of assets and forming a resource out of multiple assets. This was also discussed beforehand in a research data infrastructure group and we thought about using the Research Object Crate (RO-Crate) concept or the Annotated Research Object (ARC) concept as our 'assets'. It was just a short discussion and we have yet to do anything in terms of how to incorporate it into the GeoNode architecture. But these parts of the mentioned Motivation are of particular interest to us as research institutes:

Motivation [...]

  • Allows the possibility to link a single ResourceBase with multiple data files (think for instance about a Document having multiple PDF files for different languages).
  • Allows the definition of a directory hierarchy as a single data asset, making it possible to publish complex data.

Here is an excerpt of the brief discussion the research infrastructure group had regarding RO-Crates. It is a bit outdated, but I think you get the gist of it:


Looking at other data portals like CKAN or OpenAgrar (based on MyCORE framework), you can describe a dataset which consists of multiple files/resources. Here are two examples:

https://demo.ckan.org/dataset/sample-dataset-1

https://www.openagrar.de/receive/openagrar_mods_00054877?lang=en

The latter example on a GeoNode instance:

https://atlas.thuenen.de/layers/geonode_data_ingest:geonode:bze_lw_standorte_verschleiert

Were the additional files are linked as documents: grafik

There is a GeoNode developer workshop creating a so-called GeoCollection object to link multiple GeoNode ResourceBase objects together: https://docs.geonode.org/en/master/devel/workshops/index.html#create-your-own-django-app

My idea is to build on top of this concept and try to implement RO-Crate as a collection object: https://www.researchobject.org/ro-crate/

RO-Crates do use a metadata JSON to describe the Crate: https://www.researchobject.org/ro-crate/1.1/root-data-entity.html In this JSON, datasets can be defined as web resources: https://www.researchobject.org/ro-crate/1.1/data-entities.html#web-based-data-entities

Most (all?) of the listed attributes of those datasets can be read by the GeoNode API for the bundled resources. Therefore, you only need to describe the ROCrate bundle itself.


I do not propose using RO-Crates as base implementation for assets! I just wanted to make clear that the underlying motivation is interesting for a part of the GeoNode community.

gannebamm commented 2 months ago

@etj in the implementation ERD diagram, the link between RessourceBase (through Link) to Asset is shown as 0..1. Shouldn´t this be a 0..n since multiple Assets can form one RessourceBase?

In a settings file in which the storing of original data is disabled, there will be no Asset for a Dataset. In our workflow, datasets are often ingested via PostgreSQL directly and then registered with the updatelayers command. For those, there is also no Asset per se. Do you think this will be an issue? It is marked as 0..n, so it should be ok from a database model standpoint, but is it ok from a user's perspective?

etj commented 2 months ago

@gannebamm, about cardinalities: An Asset instance is an internal representation of files or data. Each Asset is presented to the external clients as a Link. So

Datasets or other ResourceBase can have no associated Assets at all, as in the case of Datasets only related to GeoServer layers.

ridoo commented 2 months ago

@etj I like to idea making things more flexible here. I took some time to think about the GNIP and want to make some comments, also by sprinkling in questions and personal opinions. However, I cannot forsee what components and workflows (e.g. geonode-importer) have to be touched in the end.

Technical questions

By having different subclasses of Asset (e.g. LocalAsset, S3Asset, ...) we may have a GeoNode instance handling datafiles on different data store backends.

Does this mean that each asset has its own StorageManager/-Handler where actual download is being delegated to? Does this complement or even rescind the changes you did recently to the DownloadHandler?

What if there is a SLD (or any other satellite file)? is it a separate Asset?

To me, this is definetely an asset on its own which also could be applied to multiple resources. However, what about differentiating xml files which shall serve as an asset and those xml files to be interpreted as metadata file.

Backwards Compatibility API: old files array can be preserved in output

I could not find a files field in the resource API. As far I can see, the ResourceBase.files includes the local paths to the files uploaded originally. Right now, it is unclear to me if these are used somewhere (besides extracting some metadata (e.g. exif) during the import process).

This means that we will be able to remove the Document class, and convert its instances into ResourceBases with an Asset handling the former document's data.

So ResourceBase is going to become a first class citizen and serves as a logical brace for simple assets, right?

From the end user perspective

Opportunities

Besides those opportunities you mentioned already, I see the following:

giohappy commented 2 months ago

@ridoo thanks for your comments. You touched on several points that we also included in out discussion. Many of them will probably come in the future, since the concept of Assets could bring a copernican changes to GeoNode in many ways...

Let me explain the current scope of this proposal first. Assets will provide the foundation for many use cases. The ones we're facing now are:

We're not going to cover the management of Assets from the GeoNode UI in this initial implementation. For the moment we only want to prepare the models to support present and future functionalities. We assume that the resource has been configured with assets in some way (DB operations, Django Admin, whatever).

@etj can you please confirm, correct, extend the points above? I also agree with @ridoo that the point about the fields API should be clarified.

ridoo commented 2 months ago

@giohappy @etj looking forward for some details.

You mentioned "copernican changes". To me, this sounds bigger than 'just' introducing an additional concept (here Asset). Just to stay curious: Is there more you have in mind?

giohappy commented 2 months ago

You mentioned "copernican changes". To me, this sounds bigger than 'just' introducing an additional concept (here Asset). Just to stay curious: Is there more you have in mind?

Let's say that this is the first step that could bring to more important changes in the future. We don't have a roadmap, actually, but making the relation between Catalog resource and Data source could:

Regarding the details on the points discussed in your comment, we need to wait for @etj which is working hard these days to connect the dots and prepare more information to share :)

giohappy commented 2 months ago

@ridoo after reviewing with @etj the status of this PR (which is ready for review), we confirm that it neither changes not adds features to GeoNode for the moment. In terms of public APIs and functionality it behaves exactly the same as before, with files and the single local storage manager replaced by Assets and their specific storage and download managers.

The next steps will be the implementation of the "primary" asset concept and the multiplicity of assets that can be assigned/downloaded to/from a resource. It will come with a new GNIP.