Rationale for `citext` requirement with Postgres

aaronrenner / phx_gen_auth

An authentication system generator for Phoenix 1.5 applications.

772 stars 55 forks source link

Rationale for `citext` requirement with Postgres #92

Closed dabaer closed 4 years ago

dabaer commented 4 years ago

I was reviewing the guides for using this project, and noticed that it currently require the citext extension when using the Postgresql DBMS. I was curious on the rationale behind this (and I do see there is a slightly similar issue that was closed as stalled) so I wanted to bring it up again for discussion.

According to the documentation for citext i see it notes the following as a potential downside:

If you declare a column as UNIQUE or PRIMARY KEY, the implicitly generated index is case-sensitive. So it's useless for case-insensitive searches, and it won't enforce uniqueness case-insensitively.

That doesn't seem to be the case for UNIQUE indexes.

I reviewed my own projects and confirmed they have been using this successfully for a very long time: create index(:account_emails, ["lower(address)"], unique: true) works as intended and prevents case variation from being entered twice.

Is there be another caveat to using this? It didn't really occur to me at the time, but not having a dependency on non-standard extensions would have swayed me if I had originally been considering citext.

Additionally the use of citext has been deprecated as of version 12 as it doesn't handle some unicode cases correctly. Even using the recommended nondeterministic collation method seems less than ideal. Could this be considered, assuming using lower also handles these unicode cases?

josevalim commented 4 years ago

Thanks for the issue. I am using the official docs as reference here.

According to the documentation for citext i see it notes the following as a potential downside:

To be clear, this downside is a downside of using lower, not citext.

Is there be another caveat to using this (lower)?

All of the caveats mentioned in the Postgres page above. Also note citext is an official extension that ships with PG.

Additionally the use of citext has been deprecated as of version 12 as it doesn't handle some unicode cases correctly.

The docs do not mention it as being deprecated. Can you please provide further reference?

Furthermore, none of lower or citext solve the unhandled unicode cases. In terms of behaviour, they are the same, as citext is simply automatically calling the lower for you.

Case folding vs case mapping is a bit complicated in Unicode and it is also language specific: so unless you are asking users to input their languages and configuring different collations per column, you will not get 100% correct behaviour according to Unicode. There are also other issues with Unicode, such as cluster composition/decomposition and normalization, where two texts look seemingly the same but they are not:

iex(3)> v1 = :unicode.characters_to_nfkd_binary("josé")
"josé"
iex(4)> v2 = "josé"
"josé"
iex(5)> v1 == v2
false

Therefore it is important to put things in context. Our goal here is to provide a basic convenience to avoid the most common pitfalls and not to address all possible cases, which would be very hard and also locale specific, hence citext.

cohawk commented 4 years ago

As a user of CockroachDB in PROD and PostgreSQL in dev, I am going to chime in here after I’ve seen a few related issue requests have come and go. And even if Citext is still actively maintained (I don’t see any mention otherwise) - it does bring up another potential point of future proof.

For the community as a whole I think for someone just starting out learning Elixir / Pheonix - installing a DB extension is an extra step that could even cause them to give up if unsuccessful.

Personally I’d rather not deal with installing DB extensions if I don’t have to. I already had my Dockerfile flow set to install PostgreSQL pgcrypto extension - but Citext I had to look into myself since I wasn’t familiar with it, AND then get security department sign off when using the generator for proof of concept service launches. Investigation, emails, meetings .... man hours.

IMO if it’s something community members can contribute to the generator and ditch the dependency - it should be worth exploring.

Btw love your work and attitudes both Aaron and Jose. Big fan of adding baked in auth to Phoenix.

josevalim commented 4 years ago

Hi @cohawk!

For the community as a whole I think for someone just starting out learning Elixir / Pheonix - installing a DB extension is an extra step that could even cause them to give up if unsuccessful.

The citext extension ships as part of PG. You don't have to install it. The only command you need to run is create extension citext and that is done as part of migrations. Therefore which complexities or point of failures am I missing?

cohawk commented 4 years ago

Hi Jose. I didn’t mean to comment after the issue was closed and you well explained the Unicode hurdle (which doesn’t apply to my internal EN only tools).

Yes, it is one command in the migrations. An additional command for the dockerfile setup (which yeah majority are not using).

I was just pointing out that one additional command could sometimes trip people up at the start, and maybe cause complexities later on (duplicating functionality for not PostgreSQL in my case).

This generator is basic convenience functionality. If it’s not intended to eventually be a full fledged feature of Phoenix - then I’m still happy to have it.

josevalim commented 4 years ago

Yes, it is one command in the migrations. An additional command for the dockerfile setup

Postgres started to bundle contrib as part of its default distribution from postgresql-10 in its official packages. That includes Ubuntu and Debian. Homebrew also includes it as well as the official Windows binaries. Which OS are you using that requires an extra command in your Dockerfile? It would be good to know to note it in the docs.

IMO if it’s something community members can contribute to the generator and ditch the dependency - it should be worth exploring.

Btw, my concern with moving in this direction is that I legitimately think we will be giving the wrong advice. Ecto has always preferred to work closely with the database and telling people to use lower, which according to Postgres docs are less efficient and more cumbersome to use than citext, would be teaching them the incorrect way to tackle the problem. So far nobody presented evidence that lower is better.

duplicating functionality for not PostgreSQL in my case

What do you mean? Note that the other supported databases, such as MSSQL and MySQL, handle this at the database level by default since their default collations are case insensitive.

cohawk commented 4 years ago

I’m probably wasting your time here, but thanks for the discussion.

Which OS are you using that requires an extra command in your Dockerfile?

Edge case scenario, requirements of everything built from local repos on Gentoo. But I thought on my dev env Ubuntu I had to specifically install the extension. Appears I mixed the 2 up since it’s not in my container code.

What do you mean? Note that the other supported databases, such as MSSQL and MySQL, handle this at the database level by default since their default collations are case insensitive.

I’m not using a supported database, so again apologize for wasting your time.

But I had meant not using lower or placing the task on either Ecto or the DB and perform the checks and conversions in the auth Elixir code.

josevalim commented 4 years ago

I’m probably wasting your time here, but thanks for the discussion.

You definitely are not. I want to hear about the corner cases. :)

I’m not using a supported database, so again apologize for wasting your time.

Oh, it is not supported in CockroachDB! Hrm... the pure Elixir solution is not that complicated either. I would define an Ecto type named MyApp.Email. It needs to implement both cast and dump functions to downcase:

defmodule MyApp.Email do
  @moduledoc """
  A custom Ecto type that always downcases emails.
  """

  use Ecto.Type

  def type, do: :string

  def cast(string) when is_binary(string), do: {:ok, String.downcase(string)}
  def cast(string), do: :error

  def dump(string) when is_binary(string), do: {:ok, String.downcase(string)}
  def dump(string), do: :error

  # The string should already be downcased when coming from the database
  def load(string) when is_binary(string), do: {:ok, string}
  def load(string), do: :error
end

And now you use MyApp.Email instead of the :string type for the e-mail field. But again, it does feel like the wrong way to go about it, especially as it is not a requirement in other DBs. :(

cohawk commented 4 years ago

And that’s much better looking that where I blasted String.downcase in my contexts and changesets.

I don’t want the database to do anything more than it’s already doing. Just mostly a generalization I’ve accumulated from working with DBAs. It might even be more performant for the DB to handle in this use case. I’ve just also always questioned the dependency (esp since it caused myself more work).

dabaer commented 4 years ago

Thank you for the response Jose. That cleared up a few misconceptions I had about both lower and citext.

To be clear, this downside is a downside of using lower, not citext.

As a downside of lower, yes. But I was mainly mentioning that the stated downside is incorrect as I understand it. I use lower() inside of a unique index, and it properly raises when attempting to insert AAAAA into the database when a record aaaaa already exists. Though I may misunderstand what they mean.

All of the caveats mentioned in the Postgres page above. Also note citext is an official extension that ships with PG.

I suppose all in all they are both an extra command that could be automatically injected with the generator, thought I'm not sure how a migration handles CREATE EXTENSION etc between databases.

The docs do not mention it as being deprecated. Can you please provide further reference?

Not deprecated per se, but there is a banner at the top of the newer docs for citext (PG >= 12) stating you should consider using nondeterministic collations instead of citext. https://www.postgresql.org/docs/12/citext.html

I think I will have to look into using nondeterministic collations instead of lower in my projects!