[SIP-35] Proposal for Improving Superset’s Python Code Organization

willbarrett commented 4 years ago

[SIP] Proposal for Improving Superset’s Python Code Organization

Motivation

As I was in the weeds fixing a bunch of Pylint failures, Max and I started going back and forth on this PR: https://github.com/apache/incubator-superset/pull/8777, which we ultimately closed. The root cause of that was a lack of shared understanding on the best code structure for Superset going forward. Talking with others at Preset, I realized that the issue was larger than just a new contributor not understanding project practices. Without a shared understanding, we are lacking a cohesive approach towards refactoring Superset - we need a technical North Star for project structure.

Preset’s developers met and identified a number of pain points in the existing code base. Here they are, with a bit of color to make the meaning clear:

Lack of separation of concerns
- There are a number of modules that have grown organically over time to truly massive size. Often these modules contain utility functions, classes of multiple types, and a broad array of other functionality. These modules are very hard to work with.
- Business logic is frequently intermixed with display logic, making code reuse more difficult.
- The model layer contains a large amount of business logic which has nothing to do with data storage.
Dependency Confusion
- Circular dependency issues crop up regularly with the existing module design
- Many modules have implicit dependencies on other modules which requires a specific load order
- Lack of Composability
- There’s a substantial amount of app-specific code in shared modules
- Views are currently loaded in all contexts, which makes light-weight command line usage (for example) impossible
Deep Class Inheritance
- In some instances, particularly where there is pre-existing inheritance in Flask-AppBuilder, the inheritance chain can be over 10 classes deep. This can make understanding some classes very challenging.

Proposed Change

In order to address these concerns, we’d like to propose a few guiding principles for Python code structure going forward:

Prefer small, functionally-homogeneous modules with only classes and pure functions in module scope
Separate business logic from display logic
- Keep the view layer lightweight and delegate to business logic objects
- To support this work, we’d like to recommend a Command pattern - more on that below
Prefer thin ORM objects
- To support this work, we’d like to recommend introducing Data Access Objects or DAOs to Superset to hold complicated query logic. More on that below.
Keep __init__.py files empty
- This will help cut down on implicit dependencies
Prefer composition over inheritance
- Collaborator objects can be simpler to understand than a deeply-nested class hierarchy.
Organize modules function and back-end objects rather than products
- Prefer:
  - Dashboard
  - Chart
  - Query
  - Database
  - Models
  - api
- Avoid:
  - SQL Lab
  - Explore

New patterns to introduce:

Command:

There are multiple patterns named Command, and the one we reference here is most similar to an Alembic migration. Commands perform a single cohesive business operation, have a very small public API, and either entirely succeed or entirely fail. In practice, this requires that database transaction behavior be controlled at the Command level. When possible, commands should be modeled such that they perform a single database transaction. In the case of failure, they raise an exception or return an error code, and it is the responsibility of the caller to translate that into a correct response for a user.

Example command:

# This is obviously not a full finished command. The point
# is to illustrate the structure of a command as
# encapsulating a business operation, and illustrate
# a likely public API.

import DashboardDAO from daos.dashboard

class MissingDashboardError(CommandError):
  pass

class UpdateFailedError(CommandError):
  pass

class InvalidDashboardError(CommandError):
  pass

class UpdateDashboardCommand < Command:
  def __init__(current_user, dashboard_id, dashboard_properties):
    self.actor = current_user
    self.dashboard_id = dashboard_id
    self.dashboard_properties = dashboard_properties

  def run():
    self.dashboard = DashboardDAO.find_dashboard(actor, dashboard_id)
    if self.dashboard IS NONE:
      raise MissingDashboardError("Dashboard could not be found")

    if !self._is_dashboard_valid():
      raise InvalidDashboardError("Dashboard parameters are invalid.")

    update_result = DashboardDAO.update(dashboard, dashboard_properties)

    if update_result != True:
      raise UpdateFailedError("Dashboard could not be updated.")

  def _is_dashboard_valid():
    # Check validity of dashboard data before save

DAO (Data Access Object):

A DAO in the context of Superset would be an object that would manage the lifecycle of SQLAlchemy models. Custom queries would be implemented in DAOs and then called from Command objects. In comparison to Command objects, DAOs have relatively broad public interfaces. DAOs should not depend on other DAOs to avoid circular dependencies. If results are needed cross-DAO, that should be orchestrated in the Command. Here’s a sample simplified DAO for illustrative purposes:

# This is not a fully finished DAO. The intention
# is to illustrate the public interface of a 
# Data Access Object, and how it might interface
# with a SQLAlchemy model.

from superset import db
import DataAccessObject from daos.base
import Dashboard from models.dashboard

class DashboardDAO(DataAccessObject):

  @staticmethod
  def find_for_update(actor, dashboard_id):
    dashboard = db.session.query(Dashboard).filter(id=dashboard_id).one_or_none()
    if dashboard
      if actor in dashboard.owners:
        return dashboard
      else
        return None
    else
      return None

  @staticmethod
  def update(dashboard, dashboard_params):
    for key, value in dashboard_params.iteritems():
      setattr(dashboard, key, value)
    try:
      db.session.merge(dashboard)
      db.session.commit()
    except SQLAlchemyError as e:
      logger.error('Failed to update dashboard: ?', e.message)
      db.session.rollback()
      return False

    return True

Proposed example package structure that follows the above principles:

superset/dashboards/api.py # Contains FAB and Flask endpoints related to the public API
superset/dashboards/schemas.py # Contains marshmallow schemas used by both the api and view endpoints
superset/dashboards/boot.py # Contains initialization and configuration code for API endpoints, invoked by `create_app()`
superset/dashboards/views.py # Contains FAB and Flask endpoints related to page-refresh views.
superset/dashboards/dao.py
superset/dashboards/commands/add_user.py
superset/dashboards/commands/create.py
superset/dashboards/helpers.py
superset/charts/api.py
superset/charts/daos.py
superset/charts/boot.py
superset/charts/schemas.py
superset/charts/commands/add_to_dashboard.py
superset/charts/commands/create.py
superset/charts/helpers.py
superset/models/chart.py # Models are thin SQLAlchemy objects - they do not implement custom queries or filters. That functionality will be found in the DAO, which manipulates the underlying SQLAlchemy model.
superset/models/dashboard.py

In this design, all systems related to a specific back-end resource have been grouped under a top-level folder. __init__.py files should be left empty to enable only pulling in the portions of the system necessary for a specific entrypoint (Celery shouldn’t need api.py or views.py for instance)

New or Changed Public Interfaces

Over time, the internals of Superset will evolve towards the new structure. Public HTTP interfaces will not be likely to change as a result of the above proposal, but code will move and alter to conform. This will impact organizations that apply their own customizations to Superset.

New dependencies

None

Migration Plan and Compatibility

Introduce refactors to existing code at a manageable pace to allow organizations relying on Superset internals time to adapt.

Rejected Alternatives

Preset discussed Service Objects as an alternative to both Commands and DAOs. We felt that Commands provided easier entrypoints for the ports of our application (API endpoints, views, command line invocations, Celery tasks) than Service Objects, and that introducing DAOs as well helped further break down concerns.

We also considered structuring top-level folders by function (api, models, etc.) but found this resulted in drastically more Python modules overall without substantially simplifying the question of where code should live.

Individuals consulted in creating this SIP

@mistercrunch @craig-rueda @dpgaspar @robdiciuccio @suddjian @nytai

craig-rueda commented 4 years ago

Looks good overall, however I would caution grouping things like models and/or daos under their domain as you have done above in favor of simply grouping them by function -- i.e. a package for daos, and one for models under a parent data package. Each child "data" package can, and should be further split based on domain. These lower-level bits invariably end up being reused and should be a little more centrally located.

willbarrett commented 4 years ago

@craig-rueda I've centralized the models in the recommendation, but have left the DAOs distributed - I don't think the DAOs will be too heavily reused outside of their domain. If the future proves me wrong, it can be refactored.

john-bodley commented 4 years ago

I agree with many issues/concerns raised in this SIP though I’m not overly familiar with commands and DAOs. Are there examples of other apps (preferably Flask) which use these constructs?

Note at Airbnb we have a number of Flask apps which follow best practices, i.e., are modular, have empty __init__.py files, use Flask blueprints, etc. which are fairly easy to understand and structured in a way that result in no circular dependencies.

I’m all for refactoring the Python codebase (leaning more on blueprints), though I’m unsure whether need to restructure the logic at this time to leverage commands and DAOs. A full restructure (as opposed to a refactor) seems like a considerable amount of work and may be somewhat overkill to address the concerns raised in this SIP.

metaperl commented 4 years ago

The root cause of that was a lack of shared understanding on the best code structure for Superset going forward

What software architecture will you choose?
What is your opinion of the Clean architecture

There are a number of modules that have grown organically over time to truly massive size. Often these modules contain utility functions, classes of multiple types, and a broad array of other functionality. These modules are very hard to work with.

How do you plan to analyze this problem? Will you create a single mother ticket and a number of child tickets for each issue?
How will you refactor all the tests in the current code base?

Business logic is frequently intermixed with display logic, making code reuse more difficult.

Do you have specific examples of this problem?
Isn't there a business reason for everything you display? how easy is it to separate the two?
do you have a code reviewer? if so, who let these problems get into the codebase in the first place?

Circular dependency issues crop up regularly with the existing module design

Would Inversion of control or dependency injection help with this? If not, what will help with it?

Many modules have implicit dependencies on other modules which requires a specific load order

This does not seem like much of a problem to me. Except that PyCharm (and other tools) run a module-reorder as part of their automatic code cleanup and this might prohibit something from loading.

There’s a substantial amount of app-specific code in shared modules

I do not understand the problem: Are you saying there is code duplication? If the purpose of a module is to do something for the app, then it stands to reason it would have app-specific code, wouldnt it? I'm confused.

In some instances, particularly where there is pre-existing inheritance in Flask-AppBuilder, the inheritance chain can be over 10 classes deep. This can make understanding some classes very challenging.

It can be challenging, but IMHO the code is well-written. I had to trace throgh 7 levels of OO inheritance here: https://github.com/apache/incubator-superset/issues/8695#issuecomment-559968158

I would be curious to see how you would improve things and whether the problems is rooted in Flask itself and FAB and Superset have no choice but to bow to their Creator.

Organize modules function and back-end objects rather than products

this phrase does not express a complete thought. I think there is some sort of grammatical error here. I have no idea what you are suggesting.

Avoid: SQL Lab, Explore

One of these is a noun/product just like the things you prefer. The other is a verb.... What is unacceptable about SQL Lab?

class DashboardDAO(DataAccessObject): def find_for_update(actor, dashboard_id):

Why is this preferred over a find_for_update in the class that actor belongs to? ditto for class DashboardDAO(DataAccessObject): def update(dashboard, dashboard_params): - this looks like a method that belongs to the Dashboard class.

DiggidyDave commented 4 years ago

The empty __init__.py files alone would provide enormous value IMO. If this SIP as a whole is approved I would suggested implementing breaking it into incremental, independently-valuable chunks. (whether it should be a single large SIP or not is a different question, perhaps... is the monolithic SIP coupling controversial suggestions with uncontroversial ones?)

willbarrett commented 4 years ago

First, thanks for your comments! I appreciate the community taking the issue seriously.

@DiggidyDave I agree that empty __init__.py files will help substantially. @dpgaspar has a PR open right now that's linked to this issue that demonstrates the patterns we're suggesting. IMHO the result is very readable and easy to work with. We hope for this SIP to be a North Star that we gradually work towards. I like your suggestion of implementing valuable bits in chunks. I'd recommend being pragmatic and attacking small enough pieces that they are comprehensible. One of the areas I'm most excited to tackle is slimming down the model layer using DAOs and commands - I think models are far too thick at the moment, but I would treat this work alone as multiple coherent changes rather than one large PR due to the potential complexity involved.

@john-bodley Superset currently has examples of both the Command pattern and a pattern similar to DAOs. Alembic's individual migration classes are an example of a command and FAB's shared filters are very similar to a DAO, in that they encapsulate shared query logic outside of the model. A DAO allows us to extract query logic from model classes to a clear destination. As we refactor, we need a target to move towards. Blueprints separate functionality well at the API layer, but will not provide a clean abstraction at the model or business logic layer without a few other patterns to lean on. I agree that fully refactoring the code is a large unit of work and expect this to be a target we move towards gradually where it makes sense. Currently the model layer is very thick and we have a lot of business logic in endpoint code. To me this is a difficult problem to solve without a shared concept of where this code should go moving forward and a pattern to leverage.

suddjian commented 4 years ago

@metaperl

who let these problems get into the codebase in the first place?

Disorganization is a result of many people independently contributing to a project, focusing on shipping functional code quickly. This has worked well for Superset, but has resulted in a fair amount of tech debt. This SIP is about putting some structure in place that can help with future efforts. Blaming committers for these issues is not helpful or appropriate.

I have no idea what you are suggesting.

The idea is that backend code should be focused on the entities being acted on, rather than the specific part of Superset that is interacting with that entity. For example, SQL Lab and Explore ("products") interact with Datasources, Queries, Users ("objects"). This is proposing that the backend should be focused on providing APIs that are object-centric (POST /datasources, GET /queries, rather than APIs that are specific to any one product within Superset.

IMHO the code is well-written. I had to trace throgh 7 levels of OO inheritance

I would argue that you should not have to delve through 7 levels of inheritance to understand what a piece of code is doing. Simpler is better.

Isn't there a business reason for everything you display? how easy is it to separate the two?

You may want to take a look at principles of Clean Architecture for some information on this.

metaperl commented 4 years ago

Blaming committers for these issues is not helpful or appropriate.

@suddjian - It certainly is not helpful or appropriate. And if you could explain why you included that comment, I would appreciate it.

suddjian commented 4 years ago

@metaperl Reviewers should help ensure that code is organized, and I think that's the idea you were going for with that bullet point. This will be especially true once the Superset community arrives at a consistent plan for code organization. Reviewers can point to the plan and request that code fit the established patterns.

But wording that idea as "who let these problems get into the codebase in the first place" reads like pointing fingers at the folks who wrote/reviewed the existing code, which is why I responded to it.

ktmud commented 4 years ago

Chime in my two cents as a passerby. I really like that different modules have their independent folders and we are introducing conventions and increasing consistency. I think an intuitive code structure and consistent paradigms help a lot for future contributors to understand the code.

Sharing the approach I took in one of my older projects (a general purpose CMS). It's slightly different than what is proposed, but very similar in concept.

My python files are organized like this:

├── __init__.py
├── app.py
├── core/   # core data models, user accounts, base data entities etc
├── ext/      # 3rd-party/non-essential extensions, s3 uploads, etc
├── lib/       # db mixins, cache decorators, utilities
├── modules/   # modules as organized in the main menu (URL route)
│   ├── __init__.py  # import and compose all modules
│   ├── module_1
│   │   ├── __init__.py
│   │   ├── admin.py   # admin views, a separate Blueprint for editing objects
│   │   ├── model.py   # operation on data entities, objects for data access
│   │   └── view.py     # User facing Blueprint for "module_1", handlers parsing parameters
│   ├── module_2
│   │   ├── __init__.py
│   │   ├── admin.py
│   │   ├── model.py
│   │   └── view.py

I find the separation of core/shared functionalities and addition of a "modules" folder especially useful because it makes the whole project easier to navigate.

Loading modules and booting the app is as simple as this.

def load_module(app, name):
    """Dynamically load modules in the `modules/` folder.
    package = 'david.modules.%s' % name
    mod = __import__(package, fromlist=['admin', 'bp', 'setup', 'view'])

    # register module blueprint and setup
    register_views(app, mod)

    if hasattr(mod, 'view'):
        register_views(app, mod.view)

    return mod

def register_views(app, *view_modules):
    """Register views, including API endpoints"""
    for v in view_modules:
        if hasattr(v, 'bp'):
            app.register_blueprint(v.bp)
        if hasattr(v, 'setup'):
            v.setup(app)

This is similar to what the boot.py file may achieve.

Hope this helps.

john-bodley commented 4 years ago

I really this we need to untangle the app from the various components. This is the partially to blame for the spaghetti code, circular dependency issues we’re currently facing.

All these modules, components, etc. need to be app agnostic and should be implemented as blueprints. If there are config or similar aspects these modules need leverage one can use the current_app proxy. The loading and registering (binding) of blueprints should occur within app.py.

dpgaspar commented 4 years ago

Adding my two cents here:

Regarding blueprints: I'm inclined on delegating the actual registering to each module manager. We can still say that they occur within app.py but the specifics are encapsulated on the module manager itself.

This way:

We would reduce the centralized complexity of app.py
Handle module specifics at the module level

For example:

- superset/
    - datasets/
        - manager.py (implements `register_views`)
        - commands/
          - <CRUD> (create.py, update.py, delete.py ...)
          - <Custom> (export.py, ...)
        - api.py
        - schemas.py
        - dao.py
    - dashboards/
      ...
    - charts/
    - databases/
    - commands/
       - base.py
         ....
    - api/
        - base.py
....

So app.py would call each module's manager.register_views() with additional pre and post hooks. This kind of pattern can open up the app for external extensions, since additional modules can be independently installed (pip) and loaded in order using config, for example:

ADDONS = [
    middleware1.Middleware1Manager,
    middleware2.Middleware2Manager,
    ...
]

Just a thought, could db_engine_specs follow a similar pattern?

The blueprints are handled by FAB's BaseApi or it's child ModelRestApi that creates a blueprint with all routes registered under the same API resource namespace, api/v1/dataset/*, api/v1/databases and OpenAPI spec tag/grouping.

willbarrett commented 4 years ago

@john-bodley I didn't address your comment directly regarding blueprints. Let me do so now:

I concur 100% with the idea of untangling the app references from the rest of the code. The problem we currently face with regards to blueprints is that:

Blueprints cannot currently be nested (see this issue on Flask)
Flask-AppBuilder (FAB) currently leverages blueprints internally.

Were we to move entirely to blueprints being defined in Superset, that would have substantial implications for the extent that we can leverage FAB in endpoints without either reworking FAB internals or adding the ability to nest blueprints to Flask. As we've been building out /api/v1, we've leveraged FAB's base classes, especially for GET endpoints.

How would you recommend reconciling this problem?

I think that @dpgaspar's suggestion for adding a register_views function to the manager.py files is a possible resolution - this would provide an API similar to blueprint functionality that could be leveraged in app.py. I'd be interested in your opinion on how close this would get us to resolving this issues you're bringing up.

john-bodley commented 4 years ago

@willbarrett maybe there's a disconnect with my understanding of FAB. Does FAB prevent us from defining all the API endpoints as blueprints and housing these under a root /api directory?

willbarrett commented 4 years ago

@john-bodley My understanding here is a little fuzzy too, so I'll let @dpgaspar correct me if I'm wrong. FAB creates blueprints under the hood whenever we use it to define endpoints. ModelView and ModelViewApi classes create collections of endpoints under a blueprint, which is then registered on the application when we call appbuilder.add_view... or a similar function. If we wanted to group those items together into an API blueprint, it would not be possible since that would result in nested blueprints.

@dpgaspar how far away from the truth am I :)?

dpgaspar commented 4 years ago

@willbarrett that's a good explanation.

@john-bodley The proposed FAB model is that a blueprint is related with a BaseApi class and the blueprint's url_prefix is defined on the route_base class attribute. By default BaseApi and ModelRestApi will compose /api/<version>/<resource_name>/ for the url_prefix of the blueprint. It's possible to overlay url_prefix has long has the endpoints don't collide.

Yet, I'm curious about the need to create a big blueprint that holds all API endpoints? Note that we aggregate all API classes under a base API class, where we can impose specific behavior/config.

villebro commented 4 years ago

I welcome this effort, and the amount of circular imports that typing has exposed really shows that something needs to be done. To get started, I would almost recommend grepping for packages with TYPE_CHECKING to see the most common circular dependencies, as those are most likely the most spaghettied pieces of code that need to be untangled. If it is possible to start by refactoring those into their new respective locations, we would be well positioned to continue refactoring the rest of the code as proposed.

apache / superset