Added humanizeStudentData.sql

DASSL / Gradebook

Open-source product to provide a practical means for instructors to record student attendance and assessment

Other

8 stars 4 forks source link

Added humanizeStudentData.sql #38

Closed zbhujwala closed 7 years ago

zbhujwala commented 7 years ago

Changed name from humanizeStudentsData.sql to humanizeStudentData.sql
Changed UPDATE to use less CASE statements
Uses the assumption that if fName is hashed, other two name fields are hashed (accomplished by adding WHERE on line 294)

smurthys commented 7 years ago

Great start @zbhujwala.

A few observations (some carried over from Summer DASSL):

Limit line length to ~80 characters:
- Consult ClassDB .sql files for some formatting approaches (including the file prologue)
- Place comments at the beginning of code segments so they are on lines with sufficient space left. For example, move the comment on L278 to a summary comment before the start of the UPDATE statement
L13: the stated criteria contains a number [0-9] is inconsistent with the WHERE predicate used
Remove the commented out line with ORDER BY
The formula to limit range of random integers generated is incorrect. Use trunc(random()*k+1) where k is one less than the count of rows in the appropriate table
- First capture k in variables and use the variables instead of using constants
I believe the LIMIT clause is unnecessary if the range problem is fixed, but please verify
I feel the implementation of the 60% rule does not have the intended effect for small number of invocations of function random (I haven't carefully read that function's documentation)
- It might be better to have 40% of the rows in the table have empty middle names (not NULL)
- In any case, it might be better to return empty string instead of NULL for the "40%"

EDIT: Removed the note on credits. I will send a reworked mail to all to provide some more guidance on the matter.

zbhujwala commented 7 years ago

I believe the LIMIT clause is unnecessary if the range problem is fixed, but please verify

I have tried omitting the LIMIT 1 clause but I have come across the error (on an inconsistent basis): ERROR: more than one row returned by a subquery using an expression...

I have therefore left LIMIT 1 in the subquery.

smurthys commented 7 years ago

Yes @hunterSchloss, I mixed elements from the two formulae. To make the variable names and their usage consistent, use trunc(random()*k)+1 where k is the exactly the count of rows in the table.

Of course, LIMIT 1 is required. Sorry about that.

(In conclusion, I am glad we have multiple reviewers. Also, I need that vacation soon.)

A few other observations:

Change the name of 'id' field to 'ID'
Change the type of id field to SERIAL so it is not necessary to specify a value in INSERT statements and rows can be added/moved around without editing id values.
Keep the original 25 middle names and throw in some empty middle names; distribute empty middle names instead of bunching them up
- I have already created a gist with this change and the change to id field's type. Hopefully this will reduce the work to do. FYI, I will delete this gist after this PR is merged.
The count variables have an artificial range of 4 digits. Change type to INTEGER.
- Write queries against the schema; not against an instance
The queries to populate the new count variables have "copy-paste" errors: all three queries write to the same variable name
- Also, change the output expression in each query to just COUNT(*) to be consistent with the formula shown at the beginning of this comment
The UPDATE query uses incorrect cases for column names. For example, fname instead of FName
The WHERE predicate is probably incomplete (because the problem wasn't fully analyzed)
- An MD5 hash is in hex and thus can have letters in it. The correct predicate tests if every character in the string is a hex digit
- The schema of Student table permits NULL in all three name parts and enforces that at least one name part is NOT NULL. So the predicate must test a concatenation of the three name parts: use the concat function; not the || operator
What happens in the sub-queries if a corresponding name part(for example, FName) is NULL?
Avoid adding comments in the middle of a query so it doesn't get in the way of reading the query. Instead, write a comment at the beginning of a query that summarizes the what and why. I show a sample rewrite after the list.
- I realize there are many instances of this kind of comment throughout the repo and we will fix them over time. We can help ourselves by not writing such comments anymore.
- Write short "end-of-line" comments if necessary. For example, after the WHERE clause in the UPDATE query: WHERE fName ~ '[0-9]'; --test if name is a hash

Here is an example of a block comment at the beginning of a query:

--Humanize every student name that is presently a hash
-- A student name is considered a hash if it is made up of only hex digits;
-- combine all name parts before testing because the schema permits NULLs in all parts.
-- Each sub-query picks one name at random from an appropriate table of names;
-- predicates such as FName LIKE '%' in the sub-queries force the sub-query to be 
-- executed once for each student row (create "correlated sub-queries")

smurthys commented 7 years ago

This code now reads so well. Thanks @zbhujwala for your effort.

I have some observations to share, but unfortunately cannot write them presently because I have to attend to other matters before my travel begins. I will look at this conversation again soon, and I am certain other reviewers (or even @zbhujwala himself) will have chimed in by then.

afig commented 7 years ago

Nice work on this file @zbhujwala. All of the changes addressed so far look great.

I have some observations on the latest version of the file (the one in commit e20ffc6) some of these observations may have been noted earlier:

There seems to be some inconsistencies between comments as to the behavior of the script (what is changed and when). From what I have interpreted, the script currently replaces all three names if any of the three names contains a hexadecimal decimal character. There are one or more comments which state a different behavior.
I am wondering if there is a small possibility of no names being picked from the list of human names. The same behavior that (sometimes) causes an error when LIMIT is omitted seems like it could lead to this happening.

The rest of the observations are line specific:

L2: I have not really made sufficient contributions to this task to warrant my name being listed in the credits, so it can be removed.
L17: "This file works on the assumption that the Student table is populated with either hashed or non-hashed names in fName, mName, and lName" The comment is not clear as to the requirements it is stating. It seems like the comment is simply requiring that the Student table must be populated. It might also be excluding the possibility of partially-hashed names. Either way, it can be re-written better convey its message.
L44, L146, and L173: These comments can be reworded to a different tense and to give the exact name of the table that values are being inserted into. For example, "Insert 100 random first names into HumanFirstNames" rather than "Inserting 100 random first names in first names table".
L293: Like the other comments, this one can be clarified a bit.
L295/296: It seems like we are making the subquery volatile rather than non-volatile like the comment states.
L301, L306, and L311: Might want to use .ID instead of .id to match the new column name introduced in the last commit.
L315: I believe @smurthys had mentioned something about making sure each name consists of only hex characters. Currently, it is matching if any of the characters are a number or contain any of the letters A-F. The latter is a problem because non-hashed names can clearly contain letters A-F. Also, we want to make sure we also match a-f (lowercase), since that is what the original anonymized data had.

smurthys commented 7 years ago

In addition to @afig's observations:

I highly recommend making sure that all items in earlier comments are addressed
I highly recommend having 25 non-empty middle initials/names and adding empty middle names on top of that. The gist I referenced earlier already has all the data.
Seriously, what does the sub-query return if FName is NULL? Note that the WHERE clause uses the AND operator.
As with ID instead of id, use Name instead of name.

On the logic of testing: The idea is the script can be safely run on Student table anytime and it will detect and humanize only rows that need to be humanized, that is, detect rows with hashes in all name parts. The question then is how to detect if a name has hashes. The proposal is to see if all name parts are composed of only hex digits with the rationale that no human name would pass that test. I worry I may be missing something.

For this purpose, it is necessary and sufficient to test if concat(FName, MName, LName) contains only hex digits. (concat ignores NULL values, so no need to COALESCE).

Yes, the current predicate does not test if every character is a hex digit; it should. And, it should test 0-9, a-f, and A-F.

zbhujwala commented 7 years ago

Some notes about commit c45fc13:

Changed regex in order to find appropriate MD5 hashed values of name fields

The current version of this file uses the regular expression FName ~* '[0-9a-f]{32}' By using ~*, the expression will become case insensitive and match patterns that are either a-f or A-F. The reason for using {32} is that MD5 output string length is always 32 characters.

The WHERE clause on the UPDATE is now

WHERE (FName ~* '[0-9a-f]{32}')
        OR (MName ~* '[0-9a-f]{32}')
        OR (LName ~* '[0-9a-f]{32}');

By using this expression, we are able to handle areas where one (or more) of the name fields are NULL. Dr. Murthy suggested using 'concat()' on all name fields but doing so would change the regex where it is checking if the string length is correct for the corresponding hash value.

Seriously, what does the sub-query return if FName is NULL? Note that the WHERE clause uses the AND operator.

By using the new WHERE clause, if a field has a NULL but the other fields contain an MD5 hash, the Student will be updated to have a human name. If they have a NULL name field but the other fields contain a "human" name, the field will not be altered (this SQL procedure is idempotent).

Changed the way subqueries in the UPDATE are made volatile

The subqueries now use SELECT DISTINCT trunc(random() * numOfLastNames + 1) * (length(LName) * 0 + 1 ). The purpose of the SELECT DISTINCT was to combat the problem that the WHERE clause would be unpredictable and return 0 to many rows when called. As for the expression * (length(LName) * 0 + 1 ), this is another dependency to make the subquery correlated to the outer query. In effect, all this expression does is multiply the random number that is generated by 1.

EDIT: Formatting

afig commented 7 years ago

Thanks for the addressing the issues. All changes seem like they're significant improvements to the previous version of the file.

I have a couple of optimizations that can be considered, which I will add either later tonight or tomorrow once I get a chance to review the file more in-depth.

afig commented 7 years ago

I agree that fields which were NULL should remain NULL. The current solution takes this into account, which is great. We just have to make sure any future solutions maintain this behavior.

Great work on the new comments, they summarize the behavior very well.

Here are a few points to consider that can each improve the readability and/or efficiency of the script

Using the case-insensitive operator was a good way of resolving that issue, and one that I had forgotten about originally. Given that, there may be an easier way to check for a match on all characters by using operators to do the opposite of both original behaviors. Essentially, use a ! in front of Postgres' regex matching operator to check for a non-match, and see if there is a way to perform the opposite regular expression. For example, the original condition was along the lines of: 'full name' ~ '[0-9A-F]' the new one could be simarlar to 'full name' !~* '![0-9A-F]'. The actual "negation operator" for the regular expression is different, and might require a different syntax all together. However, the concept stands, in which you want to match any character that is not 0-9, a-f, or A-F, because the !~* operator will "undo" that so that it matches on strings that only contain those characters. This solution will also allow the three name parts to be concatenated, and allow the script to work on hashed name parts that do not contain exactly 32 characters (a new restriction indroduced by the latest commit.
I do not see how using a SELECT DISTINCT will guarantee only returning one row, since it simply ignores duplicated values. If the current implementation guarantees only returning one name, then it must be some other facility that is performing that. Unless I am mistaken in that when multiple names were returned, they were all the same name. However, that still would not explain how that fixes the situation where no names were returned. Additionally, the current location of the DISTINCT does not appear to have any effect, since the subquery already guaranteed a scalar value.
Was there an issue with the previous LIKE '%'? The current workaround to the bug in Postgres makes it harder to see that it does not have any side-effects.

smurthys commented 7 years ago

I will leave the rest of the discussion to others, but a quick note on the OR expression in the final WHERE clause:

Consider testing the string COALESCE(FName, LName, MName) instead of testing each name part in an OR expression, because SQL does not do Boolean short-circuiting. Also, this expression can still be tested for 32 characters.

I am a bit apprehensive about testing for 32 characters, because that test assumes MD5, but it is acceptable for now.

EDIT: My suggestion to test COALESCE(FName, LName, MName) has merit, but requires some guarantees from other parts of the system and I don't think those guarantees are in place yet. So, for now, please ignore pretty much the entirety of this comment. We can revisit this if necessary after this PR is merged.

zbhujwala commented 7 years ago

Was there an issue with the previous LIKE '%'? The current workaround to the bug in Postgres makes it harder to see that it does not have any side-effects.

A couple issues with LIKE '%' were:

Some name fields are permitted to be NULL and therefore LIKE '%' would not be able to pick up those rows
The WHERE clause needed a subquery to guarantee only one row would be returned (if a subquery was not used, the random generator would be unpredictable and return 0 to many rows.) The subquery must yet again utilize a method to become volatile, therefore `trunc(random() * numOfNames + 1)
- (length(Name) * 0 + 1 )` is used.

smurthys commented 7 years ago

@zbhujwala thank you for your effort.

I am having difficulty justifying a correlated sub-sub-query as a solution. Also, addressing the issues discussed in this conversation requires a holistic approach backed by well-planned unit tests. I worry that by now this PR has taken too much of your time.

In the interest of getting this conversation to converge and this PR towards merging, I propose changing the file to look like that in commit c45fc13, but change the sub-queries in the SET clause to look as follows:

FName = (SELECT Name
         FROM HumanFirstNames
         WHERE FName LIKE '%'
         AND HumanFirstNames.ID = trunc(random() * numOfFirstNames  + 1)
         LIMIT 1
        ),
--similarly set values of MName and LName

This will leave the code with some known issues and thus likely unstable. Leave a clear note in the next commit that this branch is unstable so no one will use the branch in a build.

After the commit with the proposed change is made, all reviewers are welcome to comment with suggestions on how to address issues and documenting other issues. Also, anyone is welcome to make new commits to the branch provided the commits are verified and converge to the desired final state. If the branch has not made any progress by the time I return from vacation, I will see if I can continue the work.

@hunterSchloss: Please consider dismissing your earlier approval of this PR.

smurthys commented 7 years ago

I just pushed commits ca676c8 and 59f59d6 which address the issues identified in the script humanizeStudent.sql. They also add a test directory containing a couple of test scripts and a README file.

FYI, the script humanizeStudent.sql has changed quite a bit and requires a thorough review: almost as if this is the first version of the file.

zbhujwala commented 7 years ago

@smurthys Thank you for the commits and the testing scripts.

The new version of humanizeStudentData.sql works well and I have not come across any noticeable errors from testing it on my instance of Postgres.

I like the interesting technique of handling the way of limiting NULL values from being generated for a name field by running the function up to three times for each field (although I am not sure running a function 3 times for each name field to be humanized would be efficient due to how many queries are executed, but if it is the best way to deal with the situation it will have to remain.)

The dynamic SQL portion also seems like a great way to compact the overall size of the SQL file instead of maintaining a function for each name field if we were to implement it.

My one minor concern after running humanizeStudentData.sql is the consistency of having a specific field name as NULL or an empty string. I know the script states that currentName will remain either NULL or an empty string after the script is executed and the original fields contained either of those two but shouldn't the script change all NULLs to empty strings or vice versa on the output after this script is executed? It would seem that it would be helpful if a norm is chosen so that other .sql files can better expect what the field name consists of.

smurthys commented 7 years ago

@zbhujwala Thanks for the review.

The COALESCE function evaluates only the arguments it needs to:

COALESCE only evaluates the arguments that are needed to determine the result; that is, arguments to the right of the first non-null argument are not evaluated.

On replacing NULL with empty string or vice versa: Humanization is an optional process, and indeed the process will never be run in a production system. Thus, if consistency is the reason to replace one value with another, the issue needs to be addressed independent of humanization and across the entire application.

BTW, data designers have debated this topic extensively for decades and the simple pragmatic conclusion invariably is that NULL and empty string (in this case) mean different things. If an application wishes to replace NULL with empty string in a field, it should not at all permit NULL in that field.

smurthys commented 7 years ago

Thanks @afig for your review and the recognition.

Commit d48f5ba fixes the issues @afig identified as well as a few other issues I spotted.