dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.73k stars 3.17k forks source link

Scaffolding: Allow arbitrary SQL hint to be passed as parameter #31061

Closed alexkeh closed 1 year ago

alexkeh commented 1 year ago

Depending on the database schema, an EF Core scaffolding operation can take a very long time or be quick. Typically, the most time consuming query is when the EF Core provider queries for all the relevant metadata object info. For complex and expansive schemas, the operation can be slow. In those cases, the query can execute faster when provided a SQL hint. The hint text varies based on the schema characteristics and database vendor.

This feature request is to allow the scaffolding command to accept a SQL hint parameter, which the EF Core provider would then apply to the primary SQL that retrieves all the schema metadata info. For example:

dotnet ef dbcontext scaffold "Server=(localdb)\mssqllocaldb;Database=Blogging;Trusted_Connection=True;" Microsoft.EntityFrameworkCore.SqlServer -o Models –-hint ”<HINT TEXT>”

The EF Core provider would insert the hint text into the schema metadata SQL query to optimize.

Because schemas and possible optimizations can vary so much, the provider writer would have great difficulty in customizing hints in real time for an arbitrary schema and scaffolding operation. Instead, the EF Core developer would figure out what hint to use. When they execute the scaffolding command, they would enter in the hint to be applied.

This feature would be useful for EF Core users that scaffold often, such as when they have a DB First focus.

roji commented 1 year ago

Can you provide a concrete example of a scaffolding query and a hint that actually makes it perform significantly faster?

alexkeh commented 1 year ago

@roji Here's the example from the Oracle EF Core provider when scaffolding:

SELECT
/*+ NO_PUSH_PRED(v) */
    u.*,
    v.trigger_name,
    v.table_name,
    v.column_name,
    v.table_owner
FROM
    (
        SELECT
            sys_context('userenv', 'current_schema') AS schema,
            c.table_name,
            c.column_name,
            c.column_id,
            c.data_type,
            c.char_length,
            c.data_length,
            c.data_precision,
            c.data_scale,
            c.nullable,
            c.identity_column,
            c.collation,
            c.data_default,
            c.virtual_column,
            c.hidden_column
        FROM
                 user_tab_cols c
            INNER JOIN (
                SELECT DISTINCT
                    object_name AS table_name
                FROM
                    user_objects
                WHERE
                    object_type IN ( 'TABLE', 'VIEW', 'MATERIALIZED VIEW' )
            ) t ON t.table_name = c.table_name
        WHERE
                t.table_name <> '__EFMigrationsHistory'
            AND ( t.table_name IN ( ... ) )
    )                 u
    LEFT JOIN user_trigger_cols v ON u.table_name = v.table_name
                                     AND u.column_name = v.column_name
                                     AND u.schema = v.table_owner
ORDER BY
    u.column_id

The join to the “user_trigger_cols” view is the problematic part. The hint (NO_PUSH_PRED) works to speed up the query, taking 3.5 seconds to get the full result for 190 tables. Without the hint, it takes about 10 minutes. On a side note, the customer with this experience upgraded from EF Core 2 to version 6. In EF Core 2, the query took 30 seconds vs. 10 minutes with the same query on the same schema in EF Core 6.

Another Oracle user found that the /*+ use_hash(U, V) */ hint sped up the query considerably for them, while the customer that saw good results from /*+ NO_PUSH_PRED(v) */ experienced very minimal benefit with use_hash.

Since the specific hint that benefits each customer the most can be different depending on the schema characteristics, this request allows maximum flexibility by allowing a custom SQL hint to be used. Moreover, different DBs can differ in the hints they have available and syntax.

We've seen as more tables are added to the schema, the time the query takes slows down in proportion. If 1 table takes 15 seconds to scaffold, it can take 35 seconds for 2 tables, and 57 seconds for 3 tables.

The scaffolding command should apply the hint to the query that retrieves the schema info.

roji commented 1 year ago

On a side note, the customer with this experience upgraded from EF Core 2 to version 6. In EF Core 2, the query took 30 seconds vs. 10 minutes with the same query on the same schema in EF Core 6.

This is something I'd like to understand better: while there are indeed many changes between EF Core 2 and 6, at the end of the day scaffolding executes a provider-specific SQL query; so I'm not sure how the EF version by itself could have any impact on query times. Are you saying that the Oracle provider for EF Core 2 used a different SQL, and that's the source of the perf difference?

Another Oracle user found that the /+ use_hash(U, V) / hint sped up the query considerably for them, while the customer that saw good results from /+ NO_PUSH_PRED(v) / experienced very minimal benefit with use_hash.

An important question is whether you're also seeing a negative impact from specifying one of these hints. If not, simply including them in the query (in the provider code) seems like the right thing to do, and would optimize things out-of-the-box for all users.

More generally... Allowing users to inject hints via the command line has several problems:

/cc @bricelam for possibly allowing provider-specific command-line options to flow to the provider's scaffolding code (though again, I'd generally recommend against this approach)

alexkeh commented 1 year ago

Scaffolding executes multiple queries; which one would the switch affect? All of them? What if the user wants to inject one switch to one query, but another to another query?

Only the query that retrieves the schema metadata info would have the hint, not all of them. That consistently has been the most expensive query from scaffolding operations.

A query may involve multiple tables, and the hint could go into several places; how would that be managed?

Only a query hint could be added, not table hints.

Addressing the speed issue via a command-line switch is extremely un-discoverable, requiring the user to understand the perf problem, perform an analysis and come up with the precise hint that would help their case. If at all possible, the provider should do its best to have the hints that make sense by default, so that the regular user gets reasonable (even if not perfect!) performance out-of-the-box.

Yes, it’s a less than perfect proposal. We’re open to alternative possibilities to solve this performance problem.

If it's really necessary to allow users to tweak scaffolding queries to their precise database schema, then my recommendation would be for the user to replace the provider's DatabaseModelFactory and override the specific queries. This allows maximum flexibility (provided your OracleDatabaseModelFactory is factored to allow selective overriding of queries), allowing the advanced user total control over each and every query. My logic here is that if the user is advanced enough to understand the problem and come up with the right hint, they're advanced enough to also do the trivial plugging in of their tweaked OracleDatabaseModelFactory (you could have a sample in your docs to show that). A single "hint" command-line switch simply isn't enough for the general case.

The OracleDatabaseModelFactory is modelled similar to SqlServerDatabaseModelFactory and NpgSqlDatabaseModelFactory. They have not been designed for developers to selectively tweak generated SQLs. If a developer provides their own version of DatabaseModelFactory, then they will have to write code to generate ALL the SQLs involved. Is there a way to selectively override the query that an EF Core provider will use?

roji commented 1 year ago

Before answering the specific questions above, here's a more general question... What OracleDatabaseModelFactory - returning a representation of a database schema - seems like quite a general/universal need. For example, the JDBC specifications have DatabaseMetaData, which also exposes information about the tables/columns in the database. Is there some specific reason

In other words, I'm trying to understand what performance problem exists in OracleDatabaseModelFactory - requiring the aforementioned hint - that doesn't also exist in e.g. JDBC's OracleDatabaseMetaData, and why (I'm not aware of any special EF needs here)... Maybe the JDBC implementation could help arrive at a more efficient implementation, rather than requiring a hint to be injected?

Only the query that retrieves the schema metadata info would have the hint, not all of them. That consistently has been the most expensive query from scaffolding operations. [...] Only a query hint could be added, not table hints.

I understand that this would address the very specific issue you're seeing now with that particular user on Oracle, but it seems problematic as a general mechanism. Tomorrow another user could raise another problematic query, at which point we'd be stuck with just a single "--hint" command line parameter.

They have not been designed for developers to selectively tweak generated SQLs.

You can easily expose protected methods from OracleDatabaseModelFactory whose job is to return the SQL for scaffolding tables, schemas, etc.; the user could override these to provide their own SQL. You could also provide a specific hook to manage the specific hint being discussed above, if you wish.

However, I'd again recommend understanding exactly what it is that makes OracleDatabaseModelFactory require special user intervention in the form of a hint, where the same task elsewhere - exposing database schema information e.g. in JDBC - does not.

ajcvickers commented 1 year ago

We discussed this in triage, and we don't believe that this is the best way to handle poorly performing queries in model scaffolding.