Closed dabaer closed 4 years ago
Thanks for the issue. I am using the official docs as reference here.
According to the documentation for citext i see it notes the following as a potential downside:
To be clear, this downside is a downside of using lower, not citext.
Is there be another caveat to using this (lower)?
All of the caveats mentioned in the Postgres page above. Also note citext is an official extension that ships with PG.
Additionally the use of citext has been deprecated as of version 12 as it doesn't handle some unicode cases correctly.
The docs do not mention it as being deprecated. Can you please provide further reference?
Furthermore, none of lower or citext solve the unhandled unicode cases. In terms of behaviour, they are the same, as citext is simply automatically calling the lower
for you.
Case folding vs case mapping is a bit complicated in Unicode and it is also language specific: so unless you are asking users to input their languages and configuring different collations per column, you will not get 100% correct behaviour according to Unicode. There are also other issues with Unicode, such as cluster composition/decomposition and normalization, where two texts look seemingly the same but they are not:
iex(3)> v1 = :unicode.characters_to_nfkd_binary("josé")
"josé"
iex(4)> v2 = "josé"
"josé"
iex(5)> v1 == v2
false
Therefore it is important to put things in context. Our goal here is to provide a basic convenience to avoid the most common pitfalls and not to address all possible cases, which would be very hard and also locale specific, hence citext.
As a user of CockroachDB in PROD and PostgreSQL in dev, I am going to chime in here after I’ve seen a few related issue requests have come and go. And even if Citext is still actively maintained (I don’t see any mention otherwise) - it does bring up another potential point of future proof.
For the community as a whole I think for someone just starting out learning Elixir / Pheonix - installing a DB extension is an extra step that could even cause them to give up if unsuccessful.
Personally I’d rather not deal with installing DB extensions if I don’t have to. I already had my Dockerfile flow set to install PostgreSQL pgcrypto extension - but Citext I had to look into myself since I wasn’t familiar with it, AND then get security department sign off when using the generator for proof of concept service launches. Investigation, emails, meetings .... man hours.
IMO if it’s something community members can contribute to the generator and ditch the dependency - it should be worth exploring.
Btw love your work and attitudes both Aaron and Jose. Big fan of adding baked in auth to Phoenix.
Hi @cohawk!
For the community as a whole I think for someone just starting out learning Elixir / Pheonix - installing a DB extension is an extra step that could even cause them to give up if unsuccessful.
The citext extension ships as part of PG. You don't have to install it. The only command you need to run is create extension citext
and that is done as part of migrations. Therefore which complexities or point of failures am I missing?
Hi Jose. I didn’t mean to comment after the issue was closed and you well explained the Unicode hurdle (which doesn’t apply to my internal EN only tools).
Yes, it is one command in the migrations. An additional command for the dockerfile setup (which yeah majority are not using).
I was just pointing out that one additional command could sometimes trip people up at the start, and maybe cause complexities later on (duplicating functionality for not PostgreSQL in my case).
This generator is basic convenience functionality. If it’s not intended to eventually be a full fledged feature of Phoenix - then I’m still happy to have it.
Yes, it is one command in the migrations. An additional command for the dockerfile setup
Postgres started to bundle contrib as part of its default distribution from postgresql-10 in its official packages. That includes Ubuntu and Debian. Homebrew also includes it as well as the official Windows binaries. Which OS are you using that requires an extra command in your Dockerfile? It would be good to know to note it in the docs.
IMO if it’s something community members can contribute to the generator and ditch the dependency - it should be worth exploring.
Btw, my concern with moving in this direction is that I legitimately think we will be giving the wrong advice. Ecto has always preferred to work closely with the database and telling people to use lower
, which according to Postgres docs are less efficient and more cumbersome to use than citext, would be teaching them the incorrect way to tackle the problem. So far nobody presented evidence that lower is better.
duplicating functionality for not PostgreSQL in my case
What do you mean? Note that the other supported databases, such as MSSQL and MySQL, handle this at the database level by default since their default collations are case insensitive.
I’m probably wasting your time here, but thanks for the discussion.
Which OS are you using that requires an extra command in your Dockerfile?
Edge case scenario, requirements of everything built from local repos on Gentoo. But I thought on my dev env Ubuntu I had to specifically install the extension. Appears I mixed the 2 up since it’s not in my container code.
What do you mean? Note that the other supported databases, such as MSSQL and MySQL, handle this at the database level by default since their default collations are case insensitive.
I’m not using a supported database, so again apologize for wasting your time.
But I had meant not using lower or placing the task on either Ecto or the DB and perform the checks and conversions in the auth Elixir code.
I’m probably wasting your time here, but thanks for the discussion.
You definitely are not. I want to hear about the corner cases. :)
I’m not using a supported database, so again apologize for wasting your time.
Oh, it is not supported in CockroachDB! Hrm... the pure Elixir solution is not that complicated either. I would define an Ecto type named MyApp.Email
. It needs to implement both cast and dump functions to downcase:
defmodule MyApp.Email do
@moduledoc """
A custom Ecto type that always downcases emails.
"""
use Ecto.Type
def type, do: :string
def cast(string) when is_binary(string), do: {:ok, String.downcase(string)}
def cast(string), do: :error
def dump(string) when is_binary(string), do: {:ok, String.downcase(string)}
def dump(string), do: :error
# The string should already be downcased when coming from the database
def load(string) when is_binary(string), do: {:ok, string}
def load(string), do: :error
end
And now you use MyApp.Email
instead of the :string
type for the e-mail field. But again, it does feel like the wrong way to go about it, especially as it is not a requirement in other DBs. :(
And that’s much better looking that where I blasted String.downcase in my contexts and changesets.
I don’t want the database to do anything more than it’s already doing. Just mostly a generalization I’ve accumulated from working with DBAs. It might even be more performant for the DB to handle in this use case. I’ve just also always questioned the dependency (esp since it caused myself more work).
Thank you for the response Jose. That cleared up a few misconceptions I had about both lower and citext.
To be clear, this downside is a downside of using lower, not citext.
As a downside of lower, yes. But I was mainly mentioning that the stated downside is incorrect as I understand it. I use lower() inside of a unique index, and it properly raises when attempting to insert AAAAA
into the database when a record aaaaa
already exists. Though I may misunderstand what they mean.
All of the caveats mentioned in the Postgres page above. Also note citext is an official extension that ships with PG.
I suppose all in all they are both an extra command that could be automatically injected with the generator, thought I'm not sure how a migration handles CREATE EXTENSION
etc between databases.
The docs do not mention it as being deprecated. Can you please provide further reference?
Not deprecated per se, but there is a banner at the top of the newer docs for citext (PG >= 12) stating you should consider using nondeterministic collations instead of citext. https://www.postgresql.org/docs/12/citext.html
I think I will have to look into using nondeterministic collations instead of lower in my projects!
I was reviewing the guides for using this project, and noticed that it currently require the
citext
extension when using the Postgresql DBMS. I was curious on the rationale behind this (and I do see there is a slightly similar issue that was closed as stalled) so I wanted to bring it up again for discussion.According to the documentation for
citext
i see it notes the following as a potential downside:That doesn't seem to be the case for UNIQUE indexes.
I reviewed my own projects and confirmed they have been using this successfully for a very long time:
create index(:account_emails, ["lower(address)"], unique: true)
works as intended and prevents case variation from being entered twice.Is there be another caveat to using this? It didn't really occur to me at the time, but not having a dependency on non-standard extensions would have swayed me if I had originally been considering
citext
.Additionally the use of
citext
has been deprecated as of version 12 as it doesn't handle some unicode cases correctly. Even using the recommended nondeterministic collation method seems less than ideal. Could this be considered, assuming usinglower
also handles these unicode cases?