[SIP-92] Proposal for restructuring the Python code base

john-bodley commented 2 years ago

[SIP-92] Proposal for restructuring the Python code base

Motivation

Superset has evolved somewhat organically over time which is reflected—somewhat apparently—in the how the Python code—which resides solely in the top level superset folder—is organized. Initially Superset used a Model View Controlled (MVC) pattern combined with the notion of database connectors whereas now we've adopted the Data Access Object (DAO) pattern (SIP-35), which when coupled with commands and the API, helps to decouple the business layer from the persistence layer.

Due to partial refactors and years of creep the code organization is fragmented. This has negatively impacted both the code quality and developer experience.

Proposed Change

The TL;DR is the Superset application is primarily composed on a few major functional components:

APIs: v1 RESTful API (current) and API view endpoints (legacy)
CLI: Suite of command line tools
Commands: Used by both the APIs and the CLI
DAOs: Used by the commands to interface with the SQLAlchemy models
Models: Thin layer reflecting SQLAlchemy's declarative mapping and mixins
SQL Engine: Comprising of the engine specs, templating, pre-/post-processing, execution, etc.
Tasks: Asynchronous tasks/schedules
Views: Non-API endpoints, i.e., rendering HTML templates

The proposed change would be to refactor the code into more functional rather than business top level folders, which has become somewhat bloated and results in a fragmented core—something which has plagued Superset since its inception with its feature driven mindset at the cost of architectural integrity.

Below is the before/after enumeration of the current (as of 10/28/2022) top-level folders and files, where "N/A" denotes that the folder/files will no longer exist in its current form. The additional sub-sections outline more specifics.

Current	Proposed	Notes
advanced_data_type	N/A	See APIs, Commands, DAOs, and Models
annotation_layers	N/A	See APIs, Commands, DAOs, and Models
async_events	N/A	See APIs, Commands, DAOs, and Models
available_domains	N/A	See APIs, Commands, DAOs, and Models
cachekeys	N/A	See APIs, Commands, DAOs, and Models
charts	N/A	See APIs, Commands, DAOs, and Models
cli	cli	Unchanged
columns	N/A	See APIs, Commands, DAOs, and Models
commands	commands	See APIs, Commands, DAOs, and Models
common	N/A	See SQL Engine
connectors	N/A	See APIs, Commands, DAOs, and Models
css_templates	N/A	See APIs, Commands, DAOs, and Models
dao	daos	See APIs, Commands, DAOs, and Models
dashboards	N/A	See APIs, Commands, DAOs, and Models
databases	N/A	See APIs, Commands, DAOs, and Models
datasets	N/A	See APIs, Commands, DAOs, and Models
datasource	N/A	Unclear why we have both datasets and datasource DAOs
db_engine_specs	engine/specs	See SQL Engine
db_engines	N/A	Unused. See https://github.com/apache/superset/pull/20631
embedded	N/A	See APIs, Commands, DAOs, and Models
embedded_dashboard	N/A	See APIs, Commands, DAOs, and Models
examples	examples	Unchanged
explore	N/A	See APIs, Commands, DAOs, and Models
extensions	extensions	Unchanged
importexport	N/A	See APIs, Commands, DAOs, and Models
initialization	blueprints	See APIs, Commands, DAOs, and Models
key_value	N/A	See APIs, Commands, DAOs, and Models
migrations	migrations	Unchanged
models	N/A	See APIs, Commands, DAOs, and Models
queries	N/A	See APIs, Commands, DAOs, and Models
reports	N/A	See APIs, Commands, DAOs, and Models
security	security	See APIs, Commands, DAOs, and Models
sql_validators	engine/query/validators	See SQL Engine
sqllab	N/A	See APIs, Commands, DAOs, and Models
tables	N/A	See APIs, Commands, DAOs, and Models
tags	N/A	See APIs, Commands, DAOs, and Models
tasks	tasks	Unchanged
templates	blueprints/templates	See APIs, Commands, DAOs, and Models
temporary_cache	N/A	See APIs, Commands, DAOs, and Models
translations	translations	Unchanged
utils	?	Unclear
views	N/A	See APIs, Commands, DAOs, and Models
__init__.py	__init__.py	Existing logic migrated elsewhere
app.py	app.py	Unchanged
config.py	config.py	Unchanged
constants.py	N/A	See APIs, Commands, DAOs, and Modelsand SQL Engine
dataframe.py	N/A	See SQL Engine
errors.py	N/A	See APIs, Commands, DAOs, and Models
exceptions.py	N/A	See APIs, Commands, DAOs, and Models
jinja_context.py	N/A	Split into components
result_set.py	N/A	See SQL Engine
schemas.py	blueprints/api/base.py	See APIs, Commands, DAOs, and Models
sql_lab.py	N/A	See SQL Engine
sql_parse.py	?	See SQL Engine
stats_logger.py	?
superset_typing.py	N/A	Split into components
viz.py	?	Legacy visualization types

APIs, Commands, DAOs, and Models

APIs historically were mostly defined in an ad-hoc manner, i.e., in a non-RESTful way, as views which mostly reside in within the superset/views/ folder. These "legacy" APIs now coexist alongside RESTful APIs which leverage the DAO model which reside in the component specific folder, i.e., superset/datasets/. Furthermore commands are either defined within the superset/commands/ folder or component specific folder, i.e., superset/datasets/commands.

As a developer it isn't overly apparent where an API endpoint resides. The proposed solution is move to a directory structure which more clearly illustrates that the APIs, DAOs, and commands are decoupled (as illustrated below for the datasets components). The API—which leverages blueprints—is comprised both of the v1 RESTful API (current) and the legacy API. This demarcation also helps developers identify which API endpoints need to be migrated to v1.

superset/
├─ blueprints/
│  ├─ api/
|  |  ├─ legacy/         # Previously defined in superset/connectors/*/views.py, superset/views/*, etc.
|  |  ├─ v1/
|  |  |  ├─ datasets.py  # Previously superset/datasets/api.py
|  |  |  ├─ ...
│  ├─ views/             # Previously defined in superset/connectors/*/views.py, superset/views/*, etc.
├─ commands/
│  ├─ datasets/          # Previously superset/datasets/commands
|  |  ├─ ...
│  ├─ base.py
|  ├─ ...
├─ daos/
│  ├─ datasets/          # Previously superset/datasets/
|  |  ├─ ...
│  ├─ base.py            # Previously superset/dao/base.py
│  ├─ exceptions.py      # Previously superset/dao/exceptions.py
|  ├─ ...
├─ models/            # Previously superset/connectors/*/models.py, superset/datasets/models.py, etc.
|  ├─ base.py
|  ├─ ...

Note currently views are a combination of legacy API endpoints and non-API endpoints. The concept of views will remain but should only contain non-API endpoints, i.e., rendering HTML templates.

SQL Engine

Though not as well flushed out as APIs, commands, and DAOs, the actual SQL engine—responsible for preparing, executing, fetching result sets—would be colocated within the broad superset/engine/ subfolder. This would comprise of the engine specifications, templating, SQL parsing, query objects, etc.

superset/
├─ engine/
│  ├─ specs/
│  | ├─ base.py          # Previously superset/db_engine_specs/base.py
|  | ├─ ...
|  ├─ query/
|  |  ├─ validators/
|  |  |  ├─ base.py      # Previously sql_validators/base.py
|  |  |  ├─ ...
│  |  ├─ context.py      # Previously common/query_context.py
|  |  ├─ executor.py     # Previously sql_lab.py et al.
|  |  ├─ results.py      # Previously result_set.py et al.
|  |  ├─ ...
|  ├─ ...

New or Changed Public Interfaces

N/A.

New dependencies

N/A.

Migration Plan and Compatibility

The code restructure can be piecemeal. The general steps are:

Flush out the base APIs, commands, DAOs, and models
Migrate—piece by piece—each of the functional components into the appropriate sub-folders.

Rejected Alternatives

Regarding the engine directory structure I'm unsure whether the current proposal makes the most sense. An alternative could be based more on the flow/path of a query from pre-processing (including construction and SQL parsing), to execution, then fetching, and finally post-processing of the result set.

I presume that calling out the engine as a first class entity rather than treating it as a command likely makes sense. I think this is open for debate/discussion.

ktmud commented 2 years ago

I like this! Changing superset/{datasets,dashboard...}/{api,command,dao} to superset/{api,commands,daos}/{dataset,dashboard,...} would at least reduce the amount of top-level folders and make it easier to navigate for people who are just working on the APIs. And more often than not, you'd probably work on APIs utilizing multiple models rather than working on the models AND the API of a single entity type at the same time.

I also like that the core SQL engine gets its own folder. It's long overdue.

One thing I'm not sure about is can we make it faster to find all models that correspond to a db table? I assume they will all be moved to dao/..., but is the nesting under dao/{entity_type}/models.py too deep still?

Another thing that has been bothering me is---there seems to be too many boilerplates and parameters passing around between the API, the commands and the DAO. Also, what is the boundary between a command and a DAO method anyway? Currently it seems a very thick validation layer can be added to either the DAO or a command.

So I wonder if we can further simplify things down to something like this:

superset
├── api 
│   ├── legacy
│   └── v1
├── views
│   └── legacy 
│       ├── core.py
│       └── ...
├── tasks
│   └── ...
├── commands
│   ├── datasets
│   ├── dashboards
│   │   ├── create_dashboard.py
│   │   ├── get_dashboard_charts.py
│   │   └── ...
│   ├── reports
│   └── ...
├── engine
│   ├── specs
│   ├── query
│   │   ├── builder
│   │   ├── executor
│   │   └── results
├── models
│   ├── base.py
│   ├── chart.py
│   ├── dashboard.py
│   ├── dataset.py
│   ├── query.py
│   ├── report.py
│   └── ...
└── utils

A couple of things to note:

Input validation that does not require pulling data from db (e.g.) would stay with the API---ideally done by the API framework directly via custom validators (using Marshmallow or any other validation library).
commands will include any data operations (query or mutation) that require more than one model. Classes in models should try their best to not import any other models unless the dependency is strictly one directional (i.e., it's a 1-many relationship instead of many-to-many, e.g., dataset -> columns).
Validations before data writes should happen privately within the data writer. There should be no need to export the validation methods---so we can avoid passing parameters around for too many layers.

john-bodley commented 2 years ago

@ktmud thanks the the comments. I believe views—which are part of the MVC model; alongside models—are likely deemed legacy, i.e., should be migrated to RESTful API endpoints, and thus likely shouldn't be explicitly called out.

Regarding models, I'm not overly versed with the DAO pattern, though per this post DAOs and models could/should coexist and thus having a top-level superset/models/ directory seems viable. The one aspect I'm not entirely sure about is whether the existing model definitions are actually DAO-esque in nature and would need to be refactored. As an example the Dashboard model contains numerous properties/methods which reach beyond what the original intent of a Flask-SQLAlchemy model is.

ktmud commented 2 years ago

The reason I wanted to keep a views folder is exactly to call-out the legacy code so they can be more proactively cleaned up. The views folder can be removed once everything is migrated to client-side rendering.

The one aspect I'm not entirely sure about is whether the existing model definitions are actually DAO-esque in nature and would need to be refactored

Yes, I was thinking of the same thing. models should just be a thin layer of SQLAlchemy's declaritive mapping + some shared classes & mixins. We can have an explicit DAO layer, or, just stay simple and put more things in "commands" instead.

john-bodley commented 2 years ago

@ktmud I've updated the original description which explicitly calls out the models. Regarding views—per here—I alway perceived these as solely defining the legacy API as they merely contain routes hence why they're grouped under the superset/blueprints/api/legacy folder.

john-bodley commented 1 year ago

@ktmud I spoke with @hughhhh and @rusackas briefly about this and they were in agreement with you about keeping the concept of views and thus I've updated the directory schematic.

As it currently stands views contain both API endpoints as well as non-API endpoints, i.e., those which render HTML templates, and thus the non-API endpoints would be housed under the superset/blueprints/views subfolder.

ktmud commented 1 year ago

Cool. Thanks for the update!

michael-s-molina commented 1 year ago

Thanks for the SIP @john-bodley. I went through a similar process when writing SIP-61 - Improve the organization of front-end folders. When researching best practices for organizing projects, the concept of a feature-based organization appeared in many articles that described how large codebases were organized. You'll find many of the reasoning behind this model in SIP-61 and its references. The basic idea is that files related to a feature should belong together in a structured way. This allows you to easily switch between feature implementations, facilitates the use of feature flags, and also promotes better-defined dependencies. Maybe we can apply some of the concepts here too 😉

craig-rueda commented 1 year ago

Thanks for this, @john-bodley! I +1 to @ktmud 's structure. I have noticed recently that models are mixed all around the code and should be considered top-level. Ideally they would be somewhat isolated and should not depend directly on layers above, so this change would be great. I'd be glad to help out with the refactor :)

My view of what each layer "does" (open to discussion):

MVC Views / API
...
...
Commands
...
...
DAO
...
...
Models

MVC Views / API

Responsible for handling interactions with external actors (REST, views, etc)
Should perform validation from an API point of view
Translates errors emitted from mid-tier

Commands

Where the business logic lives!
Performs its own validations and potentially raises when things don't look right
Might chain to other commands
Interfaces with DAO's in order to query the DB

DAOs

Interfaces with the DB (via SQLA in our case)
Exposes low-level methods like get_by_id(), etc.
SHOULD NOT CONTAIN BUSINESS LOGIC (keeps things reusable and non-crosscutting)
Centralizes query logic, making mid-tier easy to test

Models

Basically py dataclasses :)
Utilized by upper tiers to represent persisted data
Note that quite a lot of "business" logic has crept into this layer (see: https://github.com/apache/superset/blob/master/superset/models/core.py#L379)

rusackas commented 1 year ago

Closing the issue since the vote has passed. @john-bodley please move the issue to the "Implemented/Done" column when you think that's appropriate :)

rusackas commented 9 months ago

@john-bodley is there a way to provide linting rules around this so we can (a) enforce it, and (b) track the tech debt here?

john-bodley commented 9 months ago

@rusackas I'm not sure, though I have started working with custom Pylint checkers. I suspect the "enforcement" likely happens implicitly with developers following the new patterns. That said the state of our current backend code is rather troubling as we have things strewn all over the place.

rusackas commented 1 month ago

Any updates from anyone following this on implementation status/plans?

apache / superset