Closed zbhujwala closed 7 years ago
Great start @zbhujwala.
A few observations (some carried over from Summer DASSL):
.sql
files for some formatting approaches (including the file prologue)UPDATE
statementcontains a number [0-9]
is inconsistent with the WHERE
predicate usedORDER BY
trunc(random()*k+1)
where k
is one less than the count of rows in the appropriate table
k
in variables and use the variables instead of using constantsLIMIT
clause is unnecessary if the range problem is fixed, but please verifyrandom
(I haven't carefully read that function's documentation)
NULL
)NULL
for the "40%"EDIT: Removed the note on credits. I will send a reworked mail to all to provide some more guidance on the matter.
I believe the
LIMIT
clause is unnecessary if the range problem is fixed, but please verify
I have tried omitting the LIMIT 1
clause but I have come across the error (on an inconsistent basis):
ERROR: more than one row returned by a subquery using an expression...
I have therefore left LIMIT 1 in the subquery.
Yes @hunterSchloss, I mixed elements from the two formulae. To make the variable names and their usage consistent, use trunc(random()*k)+1
where k
is the exactly the count of rows in the table.
Of course, LIMIT 1
is required. Sorry about that.
(In conclusion, I am glad we have multiple reviewers. Also, I need that vacation soon.)
A few other observations:
id
field to SERIAL
so it is not necessary to specify a value in INSERT
statements and rows can be added/moved around without editing id values.id
field's type. Hopefully this will reduce the work to do. FYI, I will delete this gist after this PR is merged.INTEGER
.
COUNT(*)
to be consistent with the formula shown at the beginning of this commentUPDATE
query uses incorrect cases for column names. For example, fname
instead of FName
Student
table permits NULL
in all three name parts and enforces that at least one name part is NOT NULL
. So the predicate must test a concatenation of the three name parts: use the concat
function; not the ||
operatorpart
(for example, FName
) is NULL
?WHERE
clause in the UPDATE
query: WHERE fName ~ '[0-9]'; --test if name is a hash
Here is an example of a block comment at the beginning of a query:
--Humanize every student name that is presently a hash
-- A student name is considered a hash if it is made up of only hex digits;
-- combine all name parts before testing because the schema permits NULLs in all parts.
-- Each sub-query picks one name at random from an appropriate table of names;
-- predicates such as FName LIKE '%' in the sub-queries force the sub-query to be
-- executed once for each student row (create "correlated sub-queries")
This code now reads so well. Thanks @zbhujwala for your effort.
I have some observations to share, but unfortunately cannot write them presently because I have to attend to other matters before my travel begins. I will look at this conversation again soon, and I am certain other reviewers (or even @zbhujwala himself) will have chimed in by then.
Nice work on this file @zbhujwala. All of the changes addressed so far look great.
I have some observations on the latest version of the file (the one in commit e20ffc6) some of these observations may have been noted earlier:
There seems to be some inconsistencies between comments as to the behavior of the script (what is changed and when). From what I have interpreted, the script currently replaces all three names if any of the three names contains a hexadecimal decimal character. There are one or more comments which state a different behavior.
I am wondering if there is a small possibility of no names being picked from the list of human names. The same behavior that (sometimes) causes an error when LIMIT is omitted seems like it could lead to this happening.
The rest of the observations are line specific:
L2: I have not really made sufficient contributions to this task to warrant my name being listed in the credits, so it can be removed.
L17: "This file works on the assumption that the Student table is populated with either hashed or non-hashed names in fName, mName, and lName" The comment is not clear as to the requirements it is stating. It seems like the comment is simply requiring that the Student table must be populated. It might also be excluding the possibility of partially-hashed names. Either way, it can be re-written better convey its message.
L44, L146, and L173: These comments can be reworded to a different tense and to give the exact name of the table that values are being inserted into. For example, "Insert 100 random first names into HumanFirstNames" rather than "Inserting 100 random first names in first names table".
L293: Like the other comments, this one can be clarified a bit.
L295/296: It seems like we are making the subquery volatile rather than non-volatile like the comment states.
L301, L306, and L311: Might want to use .ID
instead of .id
to match the new column name introduced in the last commit.
L315: I believe @smurthys had mentioned something about making sure each name consists of only hex characters. Currently, it is matching if any of the characters are a number or contain any of the letters A-F. The latter is a problem because non-hashed names can clearly contain letters A-F. Also, we want to make sure we also match a-f (lowercase), since that is what the original anonymized data had.
In addition to @afig's observations:
FName
is NULL
? Note that the WHERE
clause uses the AND
operator.ID
instead of id
, use Name
instead of name
.On the logic of testing: The idea is the script can be safely run on Student
table anytime and it will detect and humanize only rows that need to be humanized, that is, detect rows with hashes in all name parts. The question then is how to detect if a name has hashes. The proposal is to see if all name parts are composed of only hex digits with the rationale that no human name would pass that test. I worry I may be missing something.
For this purpose, it is necessary and sufficient to test if concat(FName, MName, LName)
contains only hex digits. (concat
ignores NULL
values, so no need to COALESCE
).
Yes, the current predicate does not test if every character is a hex digit; it should. And, it should test 0-9
, a-f
, and A-F
.
Some notes about commit c45fc13:
Changed regex in order to find appropriate MD5 hashed values of name fields
The current version of this file uses the regular expression FName ~* '[0-9a-f]{32}'
By using ~*
, the expression will become case insensitive and match patterns that are either a-f or A-F. The reason for using {32}
is that MD5 output string length is always 32 characters.
The WHERE
clause on the UPDATE
is now
WHERE (FName ~* '[0-9a-f]{32}')
OR (MName ~* '[0-9a-f]{32}')
OR (LName ~* '[0-9a-f]{32}');
By using this expression, we are able to handle areas where one (or more) of the name fields are NULL
. Dr. Murthy suggested using 'concat()' on all name fields but doing so would change the regex where it is checking if the string length is correct for the corresponding hash value.
Seriously, what does the sub-query return if FName is NULL? Note that the WHERE clause uses the AND operator.
By using the new WHERE
clause, if a field has a NULL
but the other fields contain an MD5 hash, the Student will be updated to have a human name. If they have a NULL
name field but the other fields contain a "human" name, the field will not be altered (this SQL procedure is idempotent).
Changed the way subqueries in the
UPDATE
are made volatile
The subqueries now use SELECT DISTINCT trunc(random() * numOfLastNames + 1) * (length(LName) * 0 + 1 )
. The purpose of the SELECT DISTINCT
was to combat the problem that the WHERE
clause would be unpredictable and return 0 to many rows when called.
As for the expression * (length(LName) * 0 + 1 )
, this is another dependency to make the subquery correlated to the outer query. In effect, all this expression does is multiply the random number that is generated by 1.
EDIT: Formatting
Thanks for the addressing the issues. All changes seem like they're significant improvements to the previous version of the file.
I have a couple of optimizations that can be considered, which I will add either later tonight or tomorrow once I get a chance to review the file more in-depth.
I agree that fields which were NULL should remain NULL. The current solution takes this into account, which is great. We just have to make sure any future solutions maintain this behavior.
Great work on the new comments, they summarize the behavior very well.
Here are a few points to consider that can each improve the readability and/or efficiency of the script
Using the case-insensitive operator was a good way of resolving that issue, and one that I had forgotten about originally. Given that, there may be an easier way to check for a match on all characters by using operators to do the opposite of both original behaviors. Essentially, use a !
in front of Postgres' regex matching operator to check for a non-match, and see if there is a way to perform the opposite regular expression. For example, the original condition was along the lines of: 'full name' ~ '[0-9A-F]'
the new one could be simarlar to 'full name' !~* '![0-9A-F]'
. The actual "negation operator" for the regular expression is different, and might require a different syntax all together. However, the concept stands, in which you want to match any character that is not 0-9, a-f, or A-F, because the !~*
operator will "undo" that so that it matches on strings that only contain those characters. This solution will also allow the three name parts to be concatenated, and allow the script to work on hashed name parts that do not contain exactly 32 characters (a new restriction indroduced by the latest commit.
I do not see how using a SELECT DISTINCT will guarantee only returning one row, since it simply ignores duplicated values. If the current implementation guarantees only returning one name, then it must be some other facility that is performing that. Unless I am mistaken in that when multiple names were returned, they were all the same name. However, that still would not explain how that fixes the situation where no names were returned. Additionally, the current location of the DISTINCT does not appear to have any effect, since the subquery already guaranteed a scalar value.
Was there an issue with the previous LIKE '%'
? The current workaround to the bug in Postgres makes it harder to see that it does not have any side-effects.
I will leave the rest of the discussion to others, but a quick note on the OR
expression in the final WHERE
clause:
Consider testing the string COALESCE(FName, LName, MName)
instead of testing each name part in an OR expression, because SQL does not do Boolean short-circuiting. Also, this expression can still be tested for 32 characters.
I am a bit apprehensive about testing for 32 characters, because that test assumes MD5, but it is acceptable for now.
EDIT: My suggestion to test COALESCE(FName, LName, MName)
has merit, but requires some guarantees from other parts of the system and I don't think those guarantees are in place yet. So, for now, please ignore pretty much the entirety of this comment. We can revisit this if necessary after this PR is merged.
Was there an issue with the previous LIKE '%'? The current workaround to the bug in Postgres makes it harder to see that it does not have any side-effects.
A couple issues with LIKE '%'
were:
NULL
and therefore LIKE '%'
would not be able to pick up those rowsWHERE
clause needed a subquery to guarantee only one row would be returned (if a subquery was not used, the random generator would be unpredictable and return 0 to many rows.) The subquery must yet again utilize a method to become volatile, therefore `trunc(random() * numOfNames + 1)
@zbhujwala thank you for your effort.
I am having difficulty justifying a correlated sub-sub-query as a solution. Also, addressing the issues discussed in this conversation requires a holistic approach backed by well-planned unit tests. I worry that by now this PR has taken too much of your time.
In the interest of getting this conversation to converge and this PR towards merging, I propose changing the file to look like that in commit c45fc13, but change the sub-queries in the SET clause to look as follows:
FName = (SELECT Name
FROM HumanFirstNames
WHERE FName LIKE '%'
AND HumanFirstNames.ID = trunc(random() * numOfFirstNames + 1)
LIMIT 1
),
--similarly set values of MName and LName
This will leave the code with some known issues and thus likely unstable. Leave a clear note in the next commit that this branch is unstable so no one will use the branch in a build.
After the commit with the proposed change is made, all reviewers are welcome to comment with suggestions on how to address issues and documenting other issues. Also, anyone is welcome to make new commits to the branch provided the commits are verified and converge to the desired final state. If the branch has not made any progress by the time I return from vacation, I will see if I can continue the work.
@hunterSchloss: Please consider dismissing your earlier approval of this PR.
I just pushed commits ca676c8 and 59f59d6 which address the issues identified in the script humanizeStudent.sql
. They also add a test directory containing a couple of test scripts and a README file.
FYI, the script humanizeStudent.sql
has changed quite a bit and requires a thorough review: almost as if this is the first version of the file.
@smurthys Thank you for the commits and the testing scripts.
The new version of humanizeStudentData.sql
works well and I have not come across any noticeable errors from testing it on my instance of Postgres.
I like the interesting technique of handling the way of limiting NULL
values from being generated for a name field by running the function up to three times for each field (although I am not sure running a function 3 times for each name field to be humanized would be efficient due to how many queries are executed, but if it is the best way to deal with the situation it will have to remain.)
The dynamic SQL portion also seems like a great way to compact the overall size of the SQL file instead of maintaining a function for each name field if we were to implement it.
My one minor concern after running humanizeStudentData.sql
is the consistency of having a specific field name as NULL
or an empty string.
I know the script states that currentName will remain either NULL
or an empty string after the script is executed and the original fields contained either of those two but shouldn't the script change all NULL
s to empty strings or vice versa on the output after this script is executed? It would seem that it would be helpful if a norm is chosen so that other .sql files can better expect what the field name consists of.
@zbhujwala Thanks for the review.
The COALESCE
function evaluates only the arguments it needs to:
COALESCE only evaluates the arguments that are needed to determine the result; that is, arguments to the right of the first non-null argument are not evaluated.
On replacing NULL
with empty string or vice versa: Humanization is an optional process, and indeed the process will never be run in a production system. Thus, if consistency is the reason to replace one value with another, the issue needs to be addressed independent of humanization and across the entire application.
BTW, data designers have debated this topic extensively for decades and the simple pragmatic conclusion invariably is that NULL
and empty string (in this case) mean different things. If an application wishes to replace NULL
with empty string in a field, it should not at all permit NULL
in that field.
Thanks @afig for your review and the recognition.
Commit d48f5ba fixes the issues @afig identified as well as a few other issues I spotted.
humanizeStudentsData.sql
tohumanizeStudentData.sql
UPDATE
to use lessCASE
statementsfName
is hashed, other two name fields are hashed (accomplished by addingWHERE
on line 294)