Closed john-bodley closed 1 year ago
I like this! Changing superset/{datasets,dashboard...}/{api,command,dao}
to superset/{api,commands,daos}/{dataset,dashboard,...}
would at least reduce the amount of top-level folders and make it easier to navigate for people who are just working on the APIs. And more often than not, you'd probably work on APIs utilizing multiple models rather than working on the models AND the API of a single entity type at the same time.
I also like that the core SQL engine gets its own folder. It's long overdue.
One thing I'm not sure about is can we make it faster to find all models that correspond to a db table? I assume they will all be moved to dao/...
, but is the nesting under dao/{entity_type}/models.py
too deep still?
Another thing that has been bothering me is---there seems to be too many boilerplates and parameters passing around between the API, the commands and the DAO. Also, what is the boundary between a command and a DAO method anyway? Currently it seems a very thick validation layer can be added to either the DAO or a command.
So I wonder if we can further simplify things down to something like this:
superset
├── api
│ ├── legacy
│ └── v1
├── views
│ └── legacy
│ ├── core.py
│ └── ...
├── tasks
│ └── ...
├── commands
│ ├── datasets
│ ├── dashboards
│ │ ├── create_dashboard.py
│ │ ├── get_dashboard_charts.py
│ │ └── ...
│ ├── reports
│ └── ...
├── engine
│ ├── specs
│ ├── query
│ │ ├── builder
│ │ ├── executor
│ │ └── results
├── models
│ ├── base.py
│ ├── chart.py
│ ├── dashboard.py
│ ├── dataset.py
│ ├── query.py
│ ├── report.py
│ └── ...
└── utils
A couple of things to note:
commands
will include any data operations (query or mutation) that require more than one model. Classes in models
should try their best to not import any other models unless the dependency is strictly one directional (i.e., it's a 1-many relationship instead of many-to-many, e.g., dataset -> columns).@ktmud thanks the the comments. I believe views—which are part of the MVC model; alongside models—are likely deemed legacy, i.e., should be migrated to RESTful API endpoints, and thus likely shouldn't be explicitly called out.
Regarding models, I'm not overly versed with the DAO pattern, though per this post DAOs and models could/should coexist and thus having a top-level superset/models/
directory seems viable. The one aspect I'm not entirely sure about is whether the existing model definitions are actually DAO-esque in nature and would need to be refactored. As an example the Dashboard model contains numerous properties/methods which reach beyond what the original intent of a Flask-SQLAlchemy model is.
The reason I wanted to keep a views
folder is exactly to call-out the legacy code so they can be more proactively cleaned up. The views
folder can be removed once everything is migrated to client-side rendering.
The one aspect I'm not entirely sure about is whether the existing model definitions are actually DAO-esque in nature and would need to be refactored
Yes, I was thinking of the same thing. models
should just be a thin layer of SQLAlchemy's declaritive mapping + some shared classes & mixins. We can have an explicit DAO layer, or, just stay simple and put more things in "commands" instead.
@ktmud I've updated the original description which explicitly calls out the models. Regarding views—per here—I alway perceived these as solely defining the legacy API as they merely contain routes hence why they're grouped under the superset/blueprints/api/legacy
folder.
@ktmud I spoke with @hughhhh and @rusackas briefly about this and they were in agreement with you about keeping the concept of views and thus I've updated the directory schematic.
As it currently stands views contain both API endpoints as well as non-API endpoints, i.e., those which render HTML templates, and thus the non-API endpoints would be housed under the superset/blueprints/views
subfolder.
Cool. Thanks for the update!
Thanks for the SIP @john-bodley. I went through a similar process when writing SIP-61 - Improve the organization of front-end folders. When researching best practices for organizing projects, the concept of a feature-based organization appeared in many articles that described how large codebases were organized. You'll find many of the reasoning behind this model in SIP-61 and its references. The basic idea is that files related to a feature should belong together in a structured way. This allows you to easily switch between feature implementations, facilitates the use of feature flags, and also promotes better-defined dependencies. Maybe we can apply some of the concepts here too 😉
Thanks for this, @john-bodley! I +1 to @ktmud 's structure. I have noticed recently that models are mixed all around the code and should be considered top-level. Ideally they would be somewhat isolated and should not depend directly on layers above, so this change would be great. I'd be glad to help out with the refactor :)
My view of what each layer "does" (open to discussion):
MVC Views / API
...
...
Commands
...
...
DAO
...
...
Models
get_by_id()
, etc.py
dataclasses :)Closing the issue since the vote has passed. @john-bodley please move the issue to the "Implemented/Done" column when you think that's appropriate :)
@john-bodley is there a way to provide linting rules around this so we can (a) enforce it, and (b) track the tech debt here?
@rusackas I'm not sure, though I have started working with custom Pylint checkers. I suspect the "enforcement" likely happens implicitly with developers following the new patterns. That said the state of our current backend code is rather troubling as we have things strewn all over the place.
Any updates from anyone following this on implementation status/plans?
[SIP-92] Proposal for restructuring the Python code base
Motivation
Superset has evolved somewhat organically over time which is reflected—somewhat apparently—in the how the Python code—which resides solely in the top level
superset
folder—is organized. Initially Superset used a Model View Controlled (MVC) pattern combined with the notion of database connectors whereas now we've adopted the Data Access Object (DAO) pattern (SIP-35), which when coupled with commands and the API, helps to decouple the business layer from the persistence layer.Due to partial refactors and years of creep the code organization is fragmented. This has negatively impacted both the code quality and developer experience.
Proposed Change
The TL;DR is the Superset application is primarily composed on a few major functional components:
The proposed change would be to refactor the code into more functional rather than business top level folders, which has become somewhat bloated and results in a fragmented core—something which has plagued Superset since its inception with its feature driven mindset at the cost of architectural integrity.
Below is the before/after enumeration of the current (as of 10/28/2022) top-level folders and files, where "N/A" denotes that the folder/files will no longer exist in its current form. The additional sub-sections outline more specifics.
APIs, Commands, DAOs, and Models
APIs historically were mostly defined in an ad-hoc manner, i.e., in a non-RESTful way, as views which mostly reside in within the
superset/views/
folder. These "legacy" APIs now coexist alongside RESTful APIs which leverage the DAO model which reside in the component specific folder, i.e.,superset/datasets/
. Furthermore commands are either defined within thesuperset/commands/
folder or component specific folder, i.e.,superset/datasets/commands
.As a developer it isn't overly apparent where an API endpoint resides. The proposed solution is move to a directory structure which more clearly illustrates that the APIs, DAOs, and commands are decoupled (as illustrated below for the datasets components). The API—which leverages blueprints—is comprised both of the v1 RESTful API (current) and the legacy API. This demarcation also helps developers identify which API endpoints need to be migrated to v1.
Note currently views are a combination of legacy API endpoints and non-API endpoints. The concept of views will remain but should only contain non-API endpoints, i.e., rendering HTML templates.
SQL Engine
Though not as well flushed out as APIs, commands, and DAOs, the actual SQL engine—responsible for preparing, executing, fetching result sets—would be colocated within the broad
superset/engine/
subfolder. This would comprise of the engine specifications, templating, SQL parsing, query objects, etc.New or Changed Public Interfaces
N/A.
New dependencies
N/A.
Migration Plan and Compatibility
The code restructure can be piecemeal. The general steps are:
Rejected Alternatives
Regarding the engine directory structure I'm unsure whether the current proposal makes the most sense. An alternative could be based more on the flow/path of a query from pre-processing (including construction and SQL parsing), to execution, then fetching, and finally post-processing of the result set.
I presume that calling out the engine as a first class entity rather than treating it as a command likely makes sense. I think this is open for debate/discussion.