edgedb / edgedb

A graph-relational database with declarative schema, built-in migration system, and a next-generation query language
https://edgedb.com
Apache License 2.0
12.79k stars 392 forks source link

Naming conventions for scalars and object types. #983

Closed vpetrovykh closed 4 years ago

vpetrovykh commented 4 years ago

Currently there's a convention in EdgeDB to name scalar types using lower_case with an added _t at the end. Object types are named using CamelCase.

The _t is meant to easily distinguish scalar type names and other lower-case names like functions and links. Abstract links are especially important here since they exist in the same namespace as the scalar types and can't use the same name. Although the problem of clashing names is a real one and needs to be solved, adding _t is somewhat awkward as a solution. In the general case of all scalar types it may be that "give better names" is the only generic reasonable advice.

However, there's a subset of scalar types that may have a different solution - enums. We could have a convention that the enum type names should be ALLCAPS to kind of remind what these types are. Using ALLCAPS for special constants is common practice in some programming languages and enums are conceptually similar to specialized constants.

Object type names should probably keep using CamelCase.

Whatever convention we agree on, should be reflected in our own built-in libraries. Incidentally, we still have some remains of "old-style enums" that emulated enum behavior by using a one_of constraint. The good news is that we only have things that are functionally enums that follow the _t scalar naming, other built-in scalars just have plain readable names.

tailhook commented 4 years ago

What are examples of the languages which use ALL_CAPS enums?

Both Python and Rust use CamelCase for all custom types: objects, scalars, and enums. And that is fine for me.

Python has ALL_CAPS for the variants of the enum, which kinda makes sense for python as they are essentially constants. But not rust. Also we, don't discuss the variants here.

vpetrovykh commented 4 years ago

I mentioned that ALLCAPS are used for special constants, like STD_LIB and STD_MODULES in our own codebase.

More importantly we need some way to come up with names for a common case of scalars - enums, which doesn't make them visually similar to object types. In Python, for example, there's no meaningful difference between a str and some user-created type because they are all subtypes of object and behaviors can be implemented via magic methods. This is not the case for EdgeDB types, there are fundamental differences in behaviors between scalars and objects, which in the past we tried to unify, but encountered a number of problems (including gross inefficiency and really messy semantics).

tailhook commented 4 years ago

I don't understand what is such an important difference? In python, there is also a lot of difference between immutable and mutable types, and between of subclasses of the built-in scalar and normal classes, and namedtuples and data classes. But we can live with that.

elprans commented 4 years ago

@vpetrovykh What is the actual argument against using CapsName? In the schema you can easily distinguish scalars from non-scalars, because the former are used in properties, and the latter in links. Their appearance in casts is also unambiguous, because we prohibit object types in casts. So it seems to me that we're overthinking the dangers of scalars using the same naming convention as object types.

vpetrovykh commented 4 years ago

In Python, all the difference in type behaviors is controlled by the user. You could make your mutable type into an immutable. You could subclass an existing "scalar" (like str) or make your own with some arbitrary additional attributes, not to mention methods. This is not possible with EdgeDB types. A str cannot gain additional properties. If you wrap it into an object type, then you'll lose the ability to perform concatenation and comparison on it (and compatibility with plain str). There's really much less flexibility in transforming one type into another in EdgeDB as compared to Python.

type Color {
   required property value -> str {
      constraint one_of('RED', 'GREEN', 'BLUE')
   }
}

is very different from

scalar type color_enum extending enum<'RED', 'GREEN', 'BLUE'>;

and in turn different from

scalar type color_str extending str {
   constraint one_of('RED', 'GREEN', 'BLUE')
}

Consider the following:

db> SELECT <color_enum>'RED' < <color_enum>'BLUE';
{true}
db> SELECT <color_str>'RED' < <color_str>'BLUE';
{false}
db> SELECT <Color>'RED' < <Color>'BLUE';
QueryError: cannot cast 'std::str' to 'default::Color'
### SELECT <Color>'RED' < <Color>'BLUE';
###      

You can compare two Color objects using a <, but it will have nothing to do with the value. I just didn't want to write out a bigger example.

There's really nothing much you can do to change this difference as it's inherent in each respective type. It really helps clarity if the type name hints at what you might expect the behavior to be.

@elprans how likely are we to implement operator overloading to erase all these differences between types? Can we erase all differences? What about adding properties to scalars? If we can't erase the differences, what is the reason for using CamelCase where snake_case will do and is more consistent with builtin scalar naming and helps maintain type separation in an easy visual way? It's CamelCase for naming scalars that needs justification, because it boils down to "we don't have a style-guide, do as you please", which kinda defeats the point of a style guide. Unless, of course if you also rename other scalars (Int64, Str, Bytes, etc.)

tailhook commented 4 years ago

what is the reason for using CamelCase where snake_case will do and is more consistent with builtin scalar naming and helps maintain type separation in an easy visual way?

The reason is that enum types very often the same as the property name:

property color -> Color;  # nice
property button_color -> color;  # artificial prefix on name
property color -> color_enum;  # artificial suffix on type
property color -> button_color;  # isn't always useful
property color -> colour;  # WTF?

Well, surely it's a bit speculation because there might be a useful name in a specific use case.

But generally, I think all user types might be camel cased.

Unless, of course if you also rename other scalars (Int64, Str, Bytes, etc.)

That distinction of the built-in or rather fundamental types is present in many languages (python, rust, typescript to name a few), and might be more important than object vs scalar type.

tailhook commented 4 years ago

By the way, @vpetrovykh, what you have shown is very similar to what you have in python by deriving from a string, named tuple or a data class. Yes, you can control less behavior than in python. But, for example, in typescript you can't control operators too, still naming convention for enums is the same.

vpetrovykh commented 4 years ago
db> SELECT <color_str>'GREEN' ++ 'ISH';
{'GREENISH'}
db> SELECT <color_enum>'GREEN' ++ 'ISH';
QueryError: operator '++' cannot be applied to operands of type 'default::color_enum' and 'std::str'
Hint: Consider using an explicit type cast or a conversion function.
### SELECT <color_enum>'GREEN' ++ 'ISH';
###        ^

So when I see <Color>'GREEN' does it really help me? My point is that this example is very contrived, in practice you're likely either having lots of different bg_color, button_color properties or conversely the scalar value is more specific like rgb_color or named_color, rather than just color. It might make sense to give the property a generic name, but giving a very specifically implemented type a generic name is like saying that number is naturally a float64. Sure, some languages do that, but that's ultimately bad naming since both int64 and float64 are intuitively "numbers", but one of them will cause problems sometimes.

At the end of the day no-one can stop the user from giving bad names, but I'm not sure what the value is in encouraging this practice in a style guide. The rule should be "do your future self a favor and make the name hint more clearly at what it is implementation-wise", this makes it easier to use types correctly.

Compare: 1) Can I use Color as array index? 2) Can I use color_enum as array index? 3) Can I use color_idx as array index?

1st1 commented 4 years ago

Arguments for CamelCase for enums:

1.1. naming is easy and straightforward 1.2. used in many other languages: python, rust; IOW using CamelCase isn't a weird novel idea 1.3. the 1.2. point is nice given that we will reflect our schema to rust/python/js in the near future

Arguments for snake_case for enums:

2.1. we already use it for some builtin scalar types (e.g. cal::local_time) 2.2. the convention so far was to visually separate scalar types from object types since they are extremely different

While I hear Victor's arguments re 2.2 I think they are a bit far-fetched. In the end, Elvis is right, what type is what is immediately visible in the schema.

What I'm more concerned with is our types in the schema module:

CREATE SCALAR TYPE schema::cardinality_t EXTENDING std::str {
    CREATE CONSTRAINT std::one_of ('ONE', 'MANY');
};

CREATE SCALAR TYPE schema::target_delete_action_t EXTENDING std::str {
    CREATE CONSTRAINT std::one_of ('RESTRICT', 'DELETE SOURCE', 'SET EMPTY',
                                   'SET DEFAULT', 'DEFERRED RESTRICT');
};

CREATE SCALAR TYPE schema::operator_kind_t EXTENDING std::str {
    CREATE CONSTRAINT std::one_of ('INFIX', 'POSTFIX', 'PREFIX', 'TERNARY');
};

CREATE SCALAR TYPE schema::volatility_t
    EXTENDING enum<'IMMUTABLE', 'STABLE', 'VOLATILE'>;

Renaming cardinality_t to cardinality_enum or cardinailty_str won't help anyone. Those names are plain ugly and frankly as a user of introspection API I don't much care about their specific types. Keeping the current name isn't an option either. Cardinality seems to be the best name possible and I don't think forcing snake_case is so necessary here. Same goes for Volatility, TargetDeleteAction, and OperatorKind.

So I'm, at least given the arguments I'm hearing now, in favor of using CamelCase for enum types or for scalar types with value-limiting constraints (enum-like).

As for how users should name their own scalar types -- I don't even think we need a convention here. Python recommends using CamelCase for types derived from str, int, etc, but vaguely and it's not enforced. And it seems to be OK. _(ツ)_/

vpetrovykh commented 4 years ago

To be clear, I tend to go through the same process every time we have a suggested change: I try to categorize the reasons for change into subjective and objective ones. Ideally, the change should be driven by objective reasons, because subjective reasons tend to be more unstable over time.

So we have a database with different entities in it. It is an objective fact that they are different because these entities have different inherent functionalities (annotations, types, functions, etc.). It's also objectively true that a naming convention reduces the cognitive burden when looking at expressions because it potentially makes it unnecessary to perform one more definition look-up when some properties of the entity can be inferred from the name and general syntax. If this weren't true we wouldn't mind naming everything as, say, shortest valid alphanumeric identifiers to save keystrokes or to simply fit more info onto each line. In principle, every concept in EdgeDB is completely disambiguated by looking up its definition or some keyword or other syntax at the point of usage. Actually, it's precisely because this is the case that we don't impose special syntax restrictions on naming and instead are considering a guide.

It is objectively useful to have different naming strategies for different entities. So in terms of types we have two broad categories that can have names: scalar types and object types. Scalars have a whole bunch of operators that are only applicable to them, while object types are exclusive in their use of properties and links (especially multi links/props, which makes them different even from tuples, even though we aren't discussing tuples here). It's apparently useful to have different naming conventions for these two categories in the same way that it's useful for having different naming conventions for links and object types.

Finally, this brings me to the point of enums. They are scalars. They are defined as scalar type and are functionally like other scalars:

If in every way they function like scalars, what is the reason to use the naming convention that is identical to object types? Why does this logic not apply to other scalars, links, annotations, constraints, functions, etc.? What makes enums inherently so much more like object types than any of these other concepts? Specifically, why don't we apply this logic to differentiate constraints from functions? Links from properties?

As for subjective reasons such as "it looks nice", I want to remind and/or tell the story of renaming "int" and "float". We used to have int and float as scalar types in std, they were exactly equivalent to current int64 and float64. These original names seemed "nice" at the time and are very commonly used in other programming languages and highly recognizable. Yet we decided to opt for the "uglier", but more explicit int64 and float64. One big reason was that the nicer generic names could be ambiguous and misleading precisely because they were lacking any extra suffixes to indicate a very important detail of their implementation when there are technically different implementation options and the "obvious default" is not obvious at all. Another reason was that if we ever get int128 as a new standard data type we couldn't just silently upgrade the int to mean that one and would potentially lead users to use the less useful int64 version (much like in modern times int32 as default is not so great since 2,000,000,000+ is not so big even as a counter in some applications). The moral of the story is that we already came to the conclusion that suffixes and more explicit scalar names are a good solution. Even though some names end up looking "ugly". Why are enums different?

Conclusion: I would argue that foo_enum is a superior scalar name to Foo for the same reasons that int64 is superior to int. At the cost of an "ugly" suffix it conveys useful information. (Yeah, I'm no longer fond of the ALLCAPS naming idea, mostly because so far we use ALLCAPS as a convention for EdgeQL keywords and a little bit because I don't want the types to scream at me.)

Are there cases when a name without suffix would be really optimal? Sure. Can the lower-case name clash with another useful lower-case name? Sure. Is this a problem? No more than a camel-case name clashing with an object type (status vs Status comes to mind). No more than having a very compelling use-case for a property named case, check, match, policy, window, commit, distinct, group, limit, module, order. Each of these is a reserved keyword preventing generic common words to be used as "nice looking" property names. The names can be used only if they are quoted leading to "ugly" paths and shapes.

vpetrovykh commented 4 years ago

After a team discussion the following naming guideline has been agreed to:

The standard library and the docs need to be updated to follow the style guide.

tailhook commented 4 years ago

And what is the naming of bigint? Is it BigInt, to_bigint, to_big_int ?

1st1 commented 4 years ago

Let's keep decimal & bigint as is

aeros commented 4 years ago

Mind if I help with updating some of type names to CamelCase? This seems like a decent intro task to work on.

1st1 commented 4 years ago

Sorry, Kyle, the issue has been resolved as part of the merged PR. Take a look at other open issues.

aeros commented 4 years ago

@1st1 Oh okay, no problem. I saw the merged PR, but I wasn't sure if that included all of the changes. Thanks for letting me know.