doctrine / dbal

Doctrine Database Abstraction Layer
https://www.doctrine-project.org/projects/dbal.html
MIT License
9.47k stars 1.34k forks source link

RFC: Improve the APIs around handling SQL object names #4772

Open morozov opened 3 years ago

morozov commented 3 years ago

Motivation

There's a bunch of issues around handling object names that exist due to the existing design assumptions (e.g. https://github.com/doctrine/dbal/issues/4357). The assumptions are:

  1. Only the identifiers that are SQL keywords should be quoted. The list of keywords managed by the DBAL is always up to date.
  2. Object names are case-insensitive on all platforms under any circumstances.
  3. Unqualified object names cannot contain dots (.).
  4. Object names cannot contain special characters (e.g. `).

The above assumptions lead to the API where any name is represented as a string which then has to be parsed to identify its meaning. E.g.:

  1. The quotes are stripped from a quoted name but without considering the escaping rules.
  2. A dot in the middle of the name is considered to be the delimiter between the name and its qualifier.

Proposal

This design proposal attempts to replace parsing the string with representing names as domain objects.

The name interface

The corner-stone interface describing object names will be Name:

/**
 * SQL object name.
 */
interface Name
{
    /**
     * Returns object name representation as an SQL literal.
     */
    public function toLiteral(NameBuilder $builder): string;

    /**
     * Returns object name representation as an SQL identifier.
     */
    public function toIdentifier(NameBuilder $builder): string;
}

It will represent any name: qualified or unqualified; quoted or unquoted and will be used by all SQL objects like tables, views, columns, databases, etc.

Note, currently, there's no plan to expose the raw value of the name. This will make it easier to abuse it in the raw form which may be the source of all sorts of issues (e.g. SQL injections).

A name can be used in two forms in SQL, hence the methods:

  1. A literal. The name will be represented as an SQL literal primarily during schema introspection. For instance:
    SELECT * FROM columns WHERE table = <literal>
  2. An identifier. The name will be represented as an SQL identifier when the corresponding object is used an an SQL statement. For instance:
    SELECT * FROM <identifier>

The SQL Builder interface family

The string representations of the name will be built by an SQL builder that is specific to the given database platform. For the sake of simplicity, we can consider the platform and the SQL builder synonymous for now.

/**
 * SQL builder that builds SQL representing object names. Each database platform must provide an implementation
 * of this interface according to its logic of handling object names.
 */
interface NameBuilder
{
    /**
     * Builds an SQL literal of an unquoted object name.
     */
    public function buildUnquotedNameLiteral(string $name): string;

    /**
     * Builds an SQL identifier of an unquoted object name.
     */
    public function buildUnquotedNameIdentifier(string $name): string;

    /**
     * Builds an SQL literal of a quoted object name.
     */
    public function buildQuotedNameLiteral(string $name): string;

    /**
     * Builds an SQL identifier of a quoted object name.
     */
    public function buildQuotedNameIdentifier(string $name): string;
}

The need to have different methods for representing names as literals and identifiers should be clear from the above. The need to have different methods for quoted and unquoted names will be explained below.

Unqualified and qualified names

Before we get to the details of the unquoted and quoted names, we need to establish a foundation regarding the unqualified and qualified names.

An unqualified name is the "given" name object without the reference to its schema or catalog (e.g. "accounts"):

/**
 * An unqualified name represents the own name of an SQL object within its schema or catalog.
 */
abstract class UnqualifiedName implements Name
{
    protected string $name;

    public function __construct(string $name)
    {
        $this->name = $name;
    }
}

A qualified name consists of the qualifier (which can be a qualified name itself) and an unqualified name. For instance, a table named "accounts" contained in the "accounting" schema can be referenced as accounting.accounts, or even more specifically, customer1.accounting.accounts where "customer1" is the name of the database or a catalog to which the "accounting" schema belongs.

/**
 * A qualified name consists of an unqualified name and a qualifier.
 */
final class QualifiedName implements Name
{
    private Name $qualifier;
    private UnqualifiedName $name;

    public function __construct(Name $qualifier, UnqualifiedName $name)
    {
        $this->qualifier = $qualifier;
        $this->name      = $name;
    }

    public function toLiteral(NameBuilder $builder): string
    {
        throw new LogicException("A qualified name cannot be converted to an SQL literal.");
    }

    public function toIdentifier(NameBuilder $builder): string
    {
        return $this->qualifier->toIdentifier($builder) . '.' . $this->name->toIdentifier($builder);
    }
}

Note that a qualified name is a compound object, so it can be used as an identifier but there's no scenario where it would be used as a literal. Hence, the corresponding method will not be implemented and isn't supposed to be used.

Unquoted and quoted names

An unqualified name can be unquoted or quoted. There are two reasons why a name may need to be quoted in SQL:

  1. The name represents an SQL keyword. The quoting may be needed for the SQL parser on a certain database platform to interpret the token as an identifier rather than a keyword.

    The following query is syntactically invalid in MySQL:

    SELECT * FROM from

    But this one is valid:

    SELECT * FROM `from`
  2. The that is not fully upper-cased must retain its original case on the platforms like Oracle and IBM DB2.

    In the the following query, the name will be internally converted to the upper case:

    SELECT * FROM accounts -- converted internally to SELECT * FROM ACCOUNTS

    But in this one it will remain lower-cased:

    SELECT * FROM "accounts"

Proper design and implementation of the casing aspect will allow us to quote identifiers unconditionally (i.e. always) which will remove the requirement to maintain the keywords lists and improve the stability and security aspects of the API.

/**
 * An unquoted name is an unqualified name that is not quoted.
 */
final class UnquotedName extends UnqualifiedName
{
    public function toLiteral(NameBuilder $builder): string
    {
        return $builder->buildUnquotedNameLiteral($this->name);
    }

    public function toIdentifier(NameBuilder $builder): string
    {
        return $builder->buildUnquotedNameIdentifier($this->name);
    }
}

/**
 * An unquoted name is an unqualified name that is quoted.
 */
final class QuotedName extends UnqualifiedName
{
    public function toLiteral(NameBuilder $builder): string
    {
        return $builder->buildQuotedNameLiteral($this->name);
    }

    public function toIdentifier(NameBuilder $builder): string
    {
        return $builder->buildQuotedNameIdentifier($this->name);
    }
}

Unquoted name normalization

Depending on the destination database platform, an unquoted name may need to be normalized before getting quoted. Currently, this is implemented in an obscure and non-systematic way like the following:

  1. https://github.com/doctrine/dbal/blob/03a976a687a523e93f5f859f18590f9f47468e0b/src/Platforms/OraclePlatform.php#L514
  2. https://github.com/doctrine/dbal/blob/03a976a687a523e93f5f859f18590f9f47468e0b/src/Platforms/DB2Platform.php#L254

Instead, we can introduce an interface that would describe this concern and be implemented by the platforms as required:

final class DefaultUnquotedNameNormalizer implements UnquotedNameNormalizer
{
    public function normalizeUnquotedName(string $name): string
    {
        return $name;
    }
}

final class UpperCasingNameNormalizer implements UnquotedNameNormalizer
{
    public function normalizeUnquotedName(string $name): string
    {
        return strtoupper($name);
    }
}

The implementations of the NameBuilder interface will depend on UnquotedNameNormalizer.

Object sets

There is a common component missing in the Schema API that would allow implementing a set of named database objects. Currently, the Schema and the Table classes have their own implementations of the set of Tables and Columns. A unified API is needed because the object names will be considered not only depending on their value but also depending on the platform in which the objects will be created.

Breaking API changes

Once the above APIs are implemented and solidified, we need to transition the APIs under the Schema namespace to use the names expressed as objects instead of plain strings:

  1. The objects that reside immediately within a schema (e.g. tables), instead of string $name will require an UnqualifiedName $name in their constructor. The same applies to the objects referencing such objects (e.g. the referenced table of a foreign key).
  2. All other objects like columns, indexes, etc. will require an UnqualifiedName $name.

To implement that, some breaking API changes are needed around schema introspection. Specifically, in the implementation of the AbstractSchemaManager::list*(). See the "Reasoning behind the design changes" section in https://github.com/doctrine/dbal/pull/4548.

This is where we're trying squeeze a qualified name into a string:

https://github.com/doctrine/dbal/blob/cc378c34815cf3377312888a71047b7e167f7a76/src/Schema/PostgreSQLSchemaManager.php#L185

And then parse: https://github.com/doctrine/dbal/blob/dc4ad2a0d7f72e9e55dd1f9e2a341c068fd2b89f/src/Schema/AbstractAsset.php#L47

And miserably fail: https://github.com/doctrine/dbal/pull/4708.

Get rid of AbstractAsset

  1. All objects have a name but for some of them it's nullable (e.g. constraints) while for others it's not (e.g. tables). Keeping it defined in a base class is a static analysis nightmare.
  2. Some objects allow a qualified name (e.g. tables when used with PostgreSQL) but others don't (e.g. columns when defined within the table). Keeping the name defined in the base class again makes it challenging for static analysis.
  3. Generation of nested object names (e.g. when creating an index on the table) is definitely not the responsibility of an abstract asset.
greg0ire commented 3 years ago

Really nice RFC! Here are the few remarks I could come up with:

Hence, the corresponding method will not be implemented and isn't supposed to be used.

Wouldn't it be easier for static analysis to catch that if the method was really not implemented? Technically, it is.

That would require splitting Name, having either

interface A
{

    /**
     * Returns object name representation as an SQL identifier.
     */
    public function toIdentifier(NameBuilder $builder): string;
}

interface Name extends A
{
    /**
     * Returns object name representation as an SQL literal.
     */
    public function toLiteral(NameBuilder $builder): string;
}

final class QualifiedName implements A
{
   // …
}

or the more symmetrical

interface A
{

    /**
     * Returns object name representation as an SQL identifier.
     */
    public function toIdentifier(NameBuilder $builder): string;
}

interface B
{
    /**
     * Returns object name representation as an SQL literal.
     */
    public function toLiteral(NameBuilder $builder): string;
}

interface Name extends A,B
{
}

final class QualifiedName implements A
{
   // …
}

The that is not fully upper-cased

A noun seems to be missing in this sentence.

I think I overall agree with what I've read here, but it would be nice to see the impact on the docs directory to better understand what impact this will have on users. For instance, I'm unsure how much they are going to use the interfaces proposed above (and I'm hoping they won't have to do it unless required).

morozov commented 3 years ago

The that is not fully upper-cased

A noun seems to be missing in this sentence.

It should be "The name that is not fully upper-cased". Thanks!

Wouldn't it be easier for static analysis to catch that if the method was really not implemented?

It definitely would. I'll do more prototyping and will try to take this into account.

FWIW, the RFC is far from being complete. There are more things to take care of, e.g. comparing the names that are now objects for equality and using them as keys in collections of named objects (e.g. table columns or schema tables). I'll continue adding more details once I get them sorted out.

beberlei commented 3 years ago

Do you see the ORM using this in the metadata drivers to convert each field to the target platforms identifier once during parsing for example? I'd like to keep this out of "regular" runtime and cache the results. It looks like that would be feasible.

morozov commented 3 years ago

So far, I've been thinking about this in the context of schema management. If the ORM uses this API for building SQL (which would make perfect sense), it can cache the names as long as the schema remains unchanged.