GeoNode / geonode

GeoNode is an open source platform that facilitates the creation, sharing, and collaborative use of geospatial data.
https://geonode.org/
Other
1.45k stars 1.12k forks source link

GNIP 89: Architecture Design - Resource and Storage Manager Modules #7664

Closed afabiani closed 3 years ago

afabiani commented 3 years ago

GNIP 89: Architecture Design - Resource and Storage Manager Modules

Overview

Currently the architectural model of GeoNode is flat.

We have a ResourceBase class, which basically stores almost all the metadata fields for every generic GeoNode resource, and several concrete instances, such as Layer, Map, Document and GeoApp, that add few more attributes and methods specific to the instance itself.

Whenever we need to create a new resource on GeoNode, actually we do:

  1. Create the ResourceBase instance and set/update few or more metadata fields
  2. Set the default permissions
  3. Check for the available GIS backend and invoke a bunch of custom methods and signals
  4. Rely on several signals around, some from the geonode.base module, some others from the geonode.<type> one and, finally, some other from the geonode.<backend_gis> one to finalize the configuration
  5. Perform an amount of hardcoded checks and controls which depends mostly on the installed modules

This "functional" approach is confusing and error prone. Moreover it is quite difficult to update or change the backends, often the developer must deal with a crazy if-else checks on every view and template.

What we would like to achieve with this GNIP, is:

  1. Make GeoNode core more modular, clean and really pluggable.
  2. Get rid of hardcoded if-else checks
  3. Get rid of most of the pre-post-<delete>/<save> signals
  4. Make the backend-gis pluggable and centralize any further check we would need when updating a resource metadata or security permission.

Last, but not least, we would like also make the uploader module more stable and efficient by avoiding redundant calls and signal fallbacks.

Proposed By

Alessio Fabiani <@afabiani> Giovanni Allegri <@giohappy> Ricardo Garcia Silva <@ricardogsilva> Mattia Giupponi <@mattiagiupponi>

Assigned to Release

This proposal is for GeoNode 4.0.

State

Proposal

In order to achieve our goals, we need to review a bit the current GeoNode core architecture design.

image

We envisage four main components of the new GeoNode resource management architecture (see Fig.1).

Storage Manager The general Storage will be able to organize and allocate GeoNode resources raw data, whatever a GeoNode resource could be. Through the Storage Manager it will be possible to read/write raw data from a pluggable (aka concrete) storage, which could be a FileSytemStorage as well as a DropboxStorage or AmazonS3Storage. The concrete storage will be defined by the settings through a factory mechanism.

On the other side we will benefit of a generic storage interface being able to expose generic methods to access the R/W operations on the raw data.

class StorageManagerInterface(metaclass=ABCMeta):

    @abstractmethod
    def delete(self, name):
        pass

    @abstractmethod
    def exists(self, name):
        pass

    @abstractmethod
    def listdir(self, path):
        pass

    @abstractmethod
    def open(self, name, mode='rb'):
        pass

    @abstractmethod
    def path(self, name):
        pass

    @abstractmethod
    def save(self, name, content, max_length=None):
        pass

    @abstractmethod
    def url(self, name):
        pass

    @abstractmethod
    def size(self, name):
        pass

    @abstractmethod
    def generate_filename(self, filename):
        pass

Resource Manager The Resource Manager exposes some atomic operations allowing it to manage the GeoNode ResourceBase models. This is an internal component meant to be used primarily by the Resource Service to manage the publication of ResourceBases into GeoNode.

As well as the Storage Manager, the Resource Manager also will benefit also of an abstract Resource Manager Interface, a default implementation and a concrete resource manager which will pluggable, defined through the settings by a factory mechanism, and being able to deal with the backend gis.

Accordingly to this architectural design, almost all the logic specifically bounded to the backend gis, will be moved to the concrete resource manager, ending up with a real separation of concerns.

The proposed Resource Manager Interface will expose some generic methods allowing to perform CRUD operations against a ResourceBase. Therefore, accordingly to this paradigm, a GeoNode developer should never break this contract by manually instantiating a ResourceBase, but, instead, passing through the resource manager methods. This is the only way to guarantee that the ResourceBase state will be always coherent and aligned with the backend gis.

The proposed Resource Manager Interface would be something like the following one

class ResourceManagerInterface(metaclass=ABCMeta):

    @abstractmethod
    def search(self, filter: dict, /, type: object = None) -> QuerySet:
        pass

    @abstractmethod
    def exists(self, uuid: str, /, instance: ResourceBase = None) -> bool:
        pass

    @abstractmethod
    def delete(self, uuid: str, /, instance: ResourceBase = None) -> int:
        pass

    @abstractmethod
    def create(self, uuid: str, /, resource_type: object = None, defaults: dict = {}) -> ResourceBase:
        pass

    @abstractmethod
    def update(self, uuid: str, /, instance: ResourceBase = None, xml_file: str = None, metadata_uploaded: bool = False,
               vals: dict = {}, regions: dict = {}, keywords: dict = {}, custom: dict = {}, notify: bool = True) -> ResourceBase:
        pass

    @abstractmethod
    def exec(self, method: str, uuid: str, /, instance: ResourceBase = None, **kwargs) -> ResourceBase:
        pass

    @abstractmethod
    def remove_permissions(self, uuid: str, /, instance: ResourceBase = None) -> bool:
        pass

    @abstractmethod
    def set_permissions(self, uuid: str, /, instance: ResourceBase = None, owner=None, permissions: dict = {}, created: bool = False) -> bool:
        pass

    @abstractmethod
    def set_workflow_permissions(self, uuid: str, /, instance: ResourceBase = None, approved: bool = False, published: bool = False) -> bool:
        pass

    @abstractmethod
    def set_thumbnail(self, uuid: str, /, instance: ResourceBase = None, overwrite: bool = True, check_bbox: bool = True) -> bool:
        pass

The concrete resource manager will be dealing with the backend gis through the factory instance, by taking care of performing the generic logic against the ResourceBase and later delegate the backend gis to finalize it.

As an instance, at __init__ time the concrete resource manager will instantiate the pluggable backend gis through the factory pattern, like shown below:

class ResourceManager(ResourceManagerInterface):

    def __init__(self):
        self._concrete_resource_manager = self._get_concrete_manager()

    def _get_concrete_manager(self):
        module_name, class_name = rm_settings.RESOURCE_MANAGER_CONCRETE_CLASS.rsplit(".", 1)
        module = importlib.import_module(module_name)
        class_ = getattr(module, class_name)
        return class_()

   ...

The implementation of an operation from the interface will take care of:

  1. Performing the generic logic against the ResourceBase
  2. Performing the specific logic against the real instance

As an instance a delete operation would be implemented as shown here below:

    @transaction.atomic
    def delete(self, uuid: str, /, instance: ResourceBase = None) -> int:
        _resource = instance or ResourceManager._get_instance(uuid)
        if _resource:
            try:
                self._concrete_resource_manager.delete(uuid, instance=_resource)  # backend gis delete
                _resource.get_real_instance().delete()  # ResourceBase delete
                return 1
            except Exception as e:
                logger.exception(e)
        return 0

Uploader The Uploader module will be fully pluggable and aware of the backend gis.

It will be relying on the Resource Manager and Storage Manager in order to create/update the ReourceBases and raw data into GeoNode.

Harvester Similarly to the Uploader, the Harvester will be relying on the Resource Manager and Storage Manager in order to create/update the ReourceBases and raw data into GeoNode.

We envisage a complete refactoring of the Harvester by improving the way it synchronizes and fetches the resources from the remote service. This will be better detailed on another GNIP.

Async States The Resource Manager will be managing the different STATES of the ResourceBase also.

This is a quite important concept. As part of this work we envisage also to move the workflow/processing state into the ResourceBase. This will allow us to ensure we never expose a not fully configured ResourceBase to the users.

As long as any component will respect the architectural contract, we will be also sure that the ResourceBases will be always in a consistent state.

Backwards Compatibility

This work won't be backward compatible with 3.2.x and 3.3.x versions.

However, since we won't break the ResourceBase model, it will be possible to easily upgrade older GeoNode versions to the new one.

Future evolution

This is a preliminary work that prepares the core to go towards a fully pluggable and headless GeoNode middleware.

Through a real separation of concerns, the code will be much more maintainable and extensible, by easily allowing developers to create their own backend gis and storage managers modules.

Feedback

Update this section with relevant feedbacks, if any.

Voting

Project Steering Committee:

Links

Remove unused links below.

francbartoli commented 3 years ago

Hi @afabiani, this is a very interesting proposal and I'm happy to see that GeoNode 4.0 is moving forward.

It would be helpful to see how the rm_settings.RESOURCE_MANAGER_CONCRETE_CLASS looks like. Do you have an example?

Another question that comes up is if with async in the diagram you are referring to the async support in Django 3.2. Just in case you are not I'm wondering if the implementation you have in mind can leave the door open for that once the backend gis will evolve in that direction.

afabiani commented 3 years ago

@francbartoli please take a look at the proposed implementation here:

Resource Manager

  1. Resource Manager Interface
  2. Default GeoNode Resource Manager
  3. GeoServer Concrete Resource Manager
  4. Resource Manager Settings (RESOURCE_MANAGER_CONCRETE_CLASS)

Storage Manager

  1. Storage Manager Interface
  2. Default Storage Manager
  3. Storage Manager Settings (aka STORAGE_MANAGER_CONCRETE_CLASS)
  4. Other Storage Manager Implementations (e.g. AWS)

GeoServer Async Uploader Refactoring

  1. Uploader Flow Diagram
  2. Replacing Signals with the RM calls

Hope this helps. Feel free to comment and review/test the DRAFT PR at your convenience.

gannebamm commented 3 years ago

@afabiani Thanks for the proposal. I am not done with reviewing it since it is complex and should be reviewed thoroughly.

@francbartoli You can take a look at the files in the PR draft here: https://github.com/GeoNode/geonode/pull/7670

manager.py for RessourceManager: https://github.com/GeoNode/geonode/pull/7670/files?file-filters%5B%5D=.py#diff-64a2aaa5de6e1fea66f83b8a29d7148d0fd47814c00dea7b5460de75b427baa6

settings.py which does provide rm_settings.RESOURCE_MANAGER_CONCRETE_CLASS: https://github.com/GeoNode/geonode/pull/7670/files?file-filters%5B%5D=.py#diff-19de244d5f733a4963ccc56a29fb73fb485f2678ea4ba2384b9fcc4c5b5e8af5

francbartoli commented 3 years ago

Thanks @gannebamm, I see. I have to deep dive into this since it is huge.

Couple of feedbacks @afabiani:

afabiani commented 3 years ago

@francbartoli :

  1. The Harvester will be pushed on a new PR, @ricardogsilva is already working on that; for the Uploader it would be better to keep it into this one in order to avoid having something very unstable. In the end the Uploader changes are very few. What do you think?
  2. Of course.
francbartoli commented 3 years ago

Thanks @afabiani. For the first point: whatever you think is better for the stability and testability of the PR of course.

gannebamm commented 3 years ago

If I understand the concept properly this should use a minio S3 compatible storage as concrete storage manager, right? See: https://github.com/gannebamm/geonode/tree/storage_test

https://github.com/gannebamm/geonode/blob/storage_test/.env#L9-L14

and https://github.com/gannebamm/geonode/blob/storage_test/docker-compose-minio-storage.yml#L137-L152

It does not work currently due some of those https://github.com/gannebamm/geonode/commit/4e66e6633978a3ceb784c307e06dcaf9be7f5d26 commits did alter the utils.py, which will raise an import error like:

raised: ImportError("cannot import name 'sync_resources_with_guardian' from 'geonode.security.utils' (/usr/src/geonode/geonode/security/utils.py)")
...
celery4geonode   |   File "/usr/src/geonode/geonode/security/tasks.py", line 23, in <module>
celery4geonode   |     from .utils import sync_resources_with_guardian
celery4geonode   | ImportError: cannot import name 'sync_resources_with_guardian' from 'geonode.security.utils' (/usr/src/geonode/geonode/security/utils.py)
afabiani commented 3 years ago

sync_resources_with_guardian has been moved under the package geonode.geoserver.security

gannebamm commented 3 years ago

sync_resources_with_guardian has been moved under the package geonode.geoserver.security

Yeah, I have seen your commits and have merged them. But the general idea is correct? Instantiate the concrete storage via .env file and using eg. a minio S3 storage?

I would like to keep this GNIP discussion on the architectural level. Errors and changes to the PR itself shall be discussed there.

t-book commented 3 years ago

My +1 for this code cleanup. I further do welcome @francbartoli suggestion to review the PR with "more than two eyes"

@afabiani Can you say how a migration from < 4 versions will look like? Will it for example work as a Django migration or are other steps needed?

afabiani commented 3 years ago

@t-book the migrations will be all automatic and the old data ported to the new attributes through several RUN_SQL statements.

@mattiagiupponi is currently working on the latest ones.

gannebamm commented 3 years ago

I also like the code cleanup and idea of pluggable stores. I would like to run a working example with a non-file-system based store, like eg. minio S3 storage, just as a proof of concept. I will try to work on that and see how easily pluggable the proposed system is. I am sorry but that will take some time (maybe middle/end of next week). My work will be visible here: https://github.com/gannebamm/geonode/tree/storage_test

gannebamm commented 3 years ago

I got it working for documents in my repo: https://github.com/gannebamm/geonode/commit/e97e943f0f66ee03656ecc6a6a47852853426ad6

You will have to create the default bucket 'geonode' manually before uploading a document. The minio admin is published on localhost:9000. It seems to work ok so far. Uploading a shapefile does work, too.

gannebamm commented 3 years ago

I have done the last test and was able to load the statics into a minio S3 storage. To actually load the frontend from S3 the nginx image needs to be edited to load from the S3 minion container. This is just configuration and not a bug. I haven´t tested a different Resource Manager (non GeoServer). To test this more work would be needed which I am unable to put into this PR. So far I am very happy with the codebase.

giohappy commented 3 years ago

A PSC call about this GNIP and the new client has been held last Friday. Here below I summarize the outcomes.

The concern raised by @t-book about the complexity of working with React / MapStore to implement custom views and pages has been discussed thoroughly with the frontend team. In the end we decided to slightly change the initial plan, where the React client also incorporated the legacy pages. The new client SPA will be hosted inside an HTML shell provided by Django templates, wich will provide a set of primary blocks. The new theme (less, css) and the primary client configuration (menus, etc.) will be provided by the new client, since we want a single source of truth to be shared between the client and any custom / legacy page. The client will be responsible of providing this common ground to any other HTML page that will be built on top of the base template. Legacy pages and custom pages will be able to override any block from the base template. A container block will serve as the body for the page, where the new client will inflate its React based SPA.

We think this is the sweet spot: the new client will still be designed as an SPA, and will incorporate more and more functionalities from the legacy UI. However, anybody will be able to write and host custom pages with simple plain Django templates, and no hacks in between. We will also gain some performance, because any above-the-fold content will be rendered quickly without having to load the complete client bundle.

This new approach will be applied initially to the integration of the legacy pages, but it will also be ported to the new home and editor pages afterwards. This change will probably be merged to the master branch during the next couple of weeks.