Closed willbarrett closed 4 years ago
Looks good overall, however I would caution grouping things like models and/or daos under their domain as you have done above in favor of simply grouping them by function -- i.e. a package for daos
, and one for models
under a parent data
package. Each child "data" package can, and should be further split based on domain. These lower-level bits invariably end up being reused and should be a little more centrally located.
@craig-rueda I've centralized the models in the recommendation, but have left the DAOs distributed - I don't think the DAOs will be too heavily reused outside of their domain. If the future proves me wrong, it can be refactored.
I agree with many issues/concerns raised in this SIP though I’m not overly familiar with commands and DAOs. Are there examples of other apps (preferably Flask) which use these constructs?
Note at Airbnb we have a number of Flask apps which follow best practices, i.e., are modular, have empty __init__.py
files, use Flask blueprints, etc. which are fairly easy to understand and structured in a way that result in no circular dependencies.
I’m all for refactoring the Python codebase (leaning more on blueprints), though I’m unsure whether need to restructure the logic at this time to leverage commands and DAOs. A full restructure (as opposed to a refactor) seems like a considerable amount of work and may be somewhat overkill to address the concerns raised in this SIP.
The root cause of that was a lack of shared understanding on the best code structure for Superset going forward
There are a number of modules that have grown organically over time to truly massive size. Often these modules contain utility functions, classes of multiple types, and a broad array of other functionality. These modules are very hard to work with.
Business logic is frequently intermixed with display logic, making code reuse more difficult.
Circular dependency issues crop up regularly with the existing module design
Would Inversion of control or dependency injection help with this? If not, what will help with it?
Many modules have implicit dependencies on other modules which requires a specific load order
This does not seem like much of a problem to me. Except that PyCharm (and other tools) run a module-reorder as part of their automatic code cleanup and this might prohibit something from loading.
There’s a substantial amount of app-specific code in shared modules
I do not understand the problem: Are you saying there is code duplication? If the purpose of a module is to do something for the app, then it stands to reason it would have app-specific code, wouldnt it? I'm confused.
In some instances, particularly where there is pre-existing inheritance in Flask-AppBuilder, the inheritance chain can be over 10 classes deep. This can make understanding some classes very challenging.
It can be challenging, but IMHO the code is well-written. I had to trace throgh 7 levels of OO inheritance here: https://github.com/apache/incubator-superset/issues/8695#issuecomment-559968158
I would be curious to see how you would improve things and whether the problems is rooted in Flask itself and FAB and Superset have no choice but to bow to their Creator.
Organize modules function and back-end objects rather than products
this phrase does not express a complete thought. I think there is some sort of grammatical error here. I have no idea what you are suggesting.
Avoid: SQL Lab, Explore
One of these is a noun/product just like the things you prefer. The other is a verb.... What is unacceptable about SQL Lab?
class DashboardDAO(DataAccessObject): def find_for_update(actor, dashboard_id):
Why is this preferred over a find_for_update in the class that actor
belongs to? ditto for
class DashboardDAO(DataAccessObject): def update(dashboard, dashboard_params):
- this looks like a method that belongs to the Dashboard class.
The empty __init__.py
files alone would provide enormous value IMO. If this SIP as a whole is approved I would suggested implementing breaking it into incremental, independently-valuable chunks. (whether it should be a single large SIP or not is a different question, perhaps... is the monolithic SIP coupling controversial suggestions with uncontroversial ones?)
First, thanks for your comments! I appreciate the community taking the issue seriously.
@DiggidyDave I agree that empty __init__.py
files will help substantially. @dpgaspar has a PR open right now that's linked to this issue that demonstrates the patterns we're suggesting. IMHO the result is very readable and easy to work with. We hope for this SIP to be a North Star that we gradually work towards. I like your suggestion of implementing valuable bits in chunks. I'd recommend being pragmatic and attacking small enough pieces that they are comprehensible. One of the areas I'm most excited to tackle is slimming down the model layer using DAOs and commands - I think models are far too thick at the moment, but I would treat this work alone as multiple coherent changes rather than one large PR due to the potential complexity involved.
@john-bodley Superset currently has examples of both the Command pattern and a pattern similar to DAOs. Alembic's individual migration classes are an example of a command and FAB's shared filters are very similar to a DAO, in that they encapsulate shared query logic outside of the model. A DAO allows us to extract query logic from model classes to a clear destination. As we refactor, we need a target to move towards. Blueprints separate functionality well at the API layer, but will not provide a clean abstraction at the model or business logic layer without a few other patterns to lean on. I agree that fully refactoring the code is a large unit of work and expect this to be a target we move towards gradually where it makes sense. Currently the model layer is very thick and we have a lot of business logic in endpoint code. To me this is a difficult problem to solve without a shared concept of where this code should go moving forward and a pattern to leverage.
@metaperl
who let these problems get into the codebase in the first place?
Disorganization is a result of many people independently contributing to a project, focusing on shipping functional code quickly. This has worked well for Superset, but has resulted in a fair amount of tech debt. This SIP is about putting some structure in place that can help with future efforts. Blaming committers for these issues is not helpful or appropriate.
I have no idea what you are suggesting.
The idea is that backend code should be focused on the entities being acted on, rather than the specific part of Superset that is interacting with that entity. For example, SQL Lab and Explore ("products") interact with Datasources, Queries, Users ("objects"). This is proposing that the backend should be focused on providing APIs that are object-centric (POST /datasources
, GET /queries
, rather than APIs that are specific to any one product within Superset.
IMHO the code is well-written. I had to trace throgh 7 levels of OO inheritance
I would argue that you should not have to delve through 7 levels of inheritance to understand what a piece of code is doing. Simpler is better.
Isn't there a business reason for everything you display? how easy is it to separate the two?
You may want to take a look at principles of Clean Architecture for some information on this.
Blaming committers for these issues is not helpful or appropriate.
@suddjian - It certainly is not helpful or appropriate. And if you could explain why you included that comment, I would appreciate it.
@metaperl Reviewers should help ensure that code is organized, and I think that's the idea you were going for with that bullet point. This will be especially true once the Superset community arrives at a consistent plan for code organization. Reviewers can point to the plan and request that code fit the established patterns.
But wording that idea as "who let these problems get into the codebase in the first place" reads like pointing fingers at the folks who wrote/reviewed the existing code, which is why I responded to it.
Chime in my two cents as a passerby. I really like that different modules have their independent folders and we are introducing conventions and increasing consistency. I think an intuitive code structure and consistent paradigms help a lot for future contributors to understand the code.
Sharing the approach I took in one of my older projects (a general purpose CMS). It's slightly different than what is proposed, but very similar in concept.
My python files are organized like this:
├── __init__.py
├── app.py
├── core/ # core data models, user accounts, base data entities etc
├── ext/ # 3rd-party/non-essential extensions, s3 uploads, etc
├── lib/ # db mixins, cache decorators, utilities
├── modules/ # modules as organized in the main menu (URL route)
│ ├── __init__.py # import and compose all modules
│ ├── module_1
│ │ ├── __init__.py
│ │ ├── admin.py # admin views, a separate Blueprint for editing objects
│ │ ├── model.py # operation on data entities, objects for data access
│ │ └── view.py # User facing Blueprint for "module_1", handlers parsing parameters
│ ├── module_2
│ │ ├── __init__.py
│ │ ├── admin.py
│ │ ├── model.py
│ │ └── view.py
I find the separation of core/shared functionalities and addition of a "modules" folder especially useful because it makes the whole project easier to navigate.
Loading modules and booting the app is as simple as this.
def load_module(app, name):
"""Dynamically load modules in the `modules/` folder.
package = 'david.modules.%s' % name
mod = __import__(package, fromlist=['admin', 'bp', 'setup', 'view'])
# register module blueprint and setup
register_views(app, mod)
if hasattr(mod, 'view'):
register_views(app, mod.view)
return mod
def register_views(app, *view_modules):
"""Register views, including API endpoints"""
for v in view_modules:
if hasattr(v, 'bp'):
app.register_blueprint(v.bp)
if hasattr(v, 'setup'):
v.setup(app)
This is similar to what the boot.py
file may achieve.
Hope this helps.
I really this we need to untangle the app
from the various components. This is the partially to blame for the spaghetti code, circular dependency issues we’re currently facing.
All these modules, components, etc. need to be app agnostic and should be implemented as blueprints. If there are config or similar aspects these modules need leverage one can use the current_app
proxy. The loading and registering (binding) of blueprints should occur within app.py
.
Adding my two cents here:
Regarding blueprints: I'm inclined on delegating the actual registering to each module manager. We can still say that they occur within app.py
but the specifics are encapsulated on the module manager itself.
This way:
app.py
For example:
- superset/
- datasets/
- manager.py (implements `register_views`)
- commands/
- <CRUD> (create.py, update.py, delete.py ...)
- <Custom> (export.py, ...)
- api.py
- schemas.py
- dao.py
- dashboards/
...
- charts/
- databases/
- commands/
- base.py
....
- api/
- base.py
....
So app.py
would call each module's manager.register_views()
with additional pre and post hooks. This kind of pattern can open up the app for external extensions, since additional modules can be independently installed (pip) and loaded in order using config, for example:
ADDONS = [
middleware1.Middleware1Manager,
middleware2.Middleware2Manager,
...
]
Just a thought, could db_engine_specs follow a similar pattern?
The blueprints are handled by FAB's BaseApi
or it's child ModelRestApi
that creates a blueprint with all routes registered under the same API resource namespace, api/v1/dataset/*
, api/v1/databases
and OpenAPI spec tag/grouping.
@john-bodley I didn't address your comment directly regarding blueprints. Let me do so now:
I concur 100% with the idea of untangling the app
references from the rest of the code. The problem we currently face with regards to blueprints is that:
Were we to move entirely to blueprints being defined in Superset, that would have substantial implications for the extent that we can leverage FAB in endpoints without either reworking FAB internals or adding the ability to nest blueprints to Flask. As we've been building out /api/v1
, we've leveraged FAB's base classes, especially for GET endpoints.
How would you recommend reconciling this problem?
I think that @dpgaspar's suggestion for adding a register_views
function to the manager.py
files is a possible resolution - this would provide an API similar to blueprint functionality that could be leveraged in app.py
. I'd be interested in your opinion on how close this would get us to resolving this issues you're bringing up.
@willbarrett maybe there's a disconnect with my understanding of FAB. Does FAB prevent us from defining all the API endpoints as blueprints and housing these under a root /api
directory?
@john-bodley My understanding here is a little fuzzy too, so I'll let @dpgaspar correct me if I'm wrong. FAB creates blueprints under the hood whenever we use it to define endpoints. ModelView
and ModelViewApi
classes create collections of endpoints under a blueprint, which is then registered on the application when we call appbuilder.add_view...
or a similar function. If we wanted to group those items together into an API blueprint, it would not be possible since that would result in nested blueprints.
@dpgaspar how far away from the truth am I :)?
@willbarrett that's a good explanation.
@john-bodley
The proposed FAB model is that a blueprint is related with a BaseApi
class and the blueprint's url_prefix
is defined on the route_base
class attribute. By default BaseApi
and ModelRestApi
will compose /api/<version>/<resource_name>/
for the url_prefix of the blueprint.
It's possible to overlay url_prefix
has long has the endpoints don't collide.
Yet, I'm curious about the need to create a big blueprint that holds all API endpoints? Note that we aggregate all API classes under a base API class, where we can impose specific behavior/config.
I welcome this effort, and the amount of circular imports that typing has exposed really shows that something needs to be done. To get started, I would almost recommend grepping for packages with TYPE_CHECKING
to see the most common circular dependencies, as those are most likely the most spaghettied pieces of code that need to be untangled. If it is possible to start by refactoring those into their new respective locations, we would be well positioned to continue refactoring the rest of the code as proposed.
[SIP] Proposal for Improving Superset’s Python Code Organization
Motivation
As I was in the weeds fixing a bunch of Pylint failures, Max and I started going back and forth on this PR: https://github.com/apache/incubator-superset/pull/8777, which we ultimately closed. The root cause of that was a lack of shared understanding on the best code structure for Superset going forward. Talking with others at Preset, I realized that the issue was larger than just a new contributor not understanding project practices. Without a shared understanding, we are lacking a cohesive approach towards refactoring Superset - we need a technical North Star for project structure.
Preset’s developers met and identified a number of pain points in the existing code base. Here they are, with a bit of color to make the meaning clear:
Proposed Change
In order to address these concerns, we’d like to propose a few guiding principles for Python code structure going forward:
__init__.py
files emptyNew patterns to introduce:
Command:
There are multiple patterns named Command, and the one we reference here is most similar to an Alembic migration. Commands perform a single cohesive business operation, have a very small public API, and either entirely succeed or entirely fail. In practice, this requires that database transaction behavior be controlled at the Command level. When possible, commands should be modeled such that they perform a single database transaction. In the case of failure, they raise an exception or return an error code, and it is the responsibility of the caller to translate that into a correct response for a user.
Example command:
DAO (Data Access Object):
A DAO in the context of Superset would be an object that would manage the lifecycle of SQLAlchemy models. Custom queries would be implemented in DAOs and then called from Command objects. In comparison to Command objects, DAOs have relatively broad public interfaces. DAOs should not depend on other DAOs to avoid circular dependencies. If results are needed cross-DAO, that should be orchestrated in the Command. Here’s a sample simplified DAO for illustrative purposes:
Proposed example package structure that follows the above principles:
In this design, all systems related to a specific back-end resource have been grouped under a top-level folder.
__init__.py
files should be left empty to enable only pulling in the portions of the system necessary for a specific entrypoint (Celery shouldn’t needapi.py
orviews.py
for instance)New or Changed Public Interfaces
Over time, the internals of Superset will evolve towards the new structure. Public HTTP interfaces will not be likely to change as a result of the above proposal, but code will move and alter to conform. This will impact organizations that apply their own customizations to Superset.
New dependencies
None
Migration Plan and Compatibility
Introduce refactors to existing code at a manageable pace to allow organizations relying on Superset internals time to adapt.
Rejected Alternatives
Preset discussed Service Objects as an alternative to both Commands and DAOs. We felt that Commands provided easier entrypoints for the ports of our application (API endpoints, views, command line invocations, Celery tasks) than Service Objects, and that introducing DAOs as well helped further break down concerns.
We also considered structuring top-level folders by function (api, models, etc.) but found this resulted in drastically more Python modules overall without substantially simplifying the question of where code should live.
Individuals consulted in creating this SIP
@mistercrunch @craig-rueda @dpgaspar @robdiciuccio @suddjian @nytai