DASSL / Gradebook

Open-source product to provide a practical means for instructors to record student attendance and assessment
Other
8 stars 4 forks source link

Added humanizeStudentData.sql #38

Closed zbhujwala closed 7 years ago

zbhujwala commented 7 years ago
smurthys commented 7 years ago

Great start @zbhujwala.

A few observations (some carried over from Summer DASSL):

EDIT: Removed the note on credits. I will send a reworked mail to all to provide some more guidance on the matter.

zbhujwala commented 7 years ago

I believe the LIMIT clause is unnecessary if the range problem is fixed, but please verify

I have tried omitting the LIMIT 1 clause but I have come across the error (on an inconsistent basis): ERROR: more than one row returned by a subquery using an expression...

I have therefore left LIMIT 1 in the subquery.

smurthys commented 7 years ago

Yes @hunterSchloss, I mixed elements from the two formulae. To make the variable names and their usage consistent, use trunc(random()*k)+1 where k is the exactly the count of rows in the table.

Of course, LIMIT 1 is required. Sorry about that.

(In conclusion, I am glad we have multiple reviewers. Also, I need that vacation soon.)

A few other observations:

Here is an example of a block comment at the beginning of a query:

--Humanize every student name that is presently a hash
-- A student name is considered a hash if it is made up of only hex digits;
-- combine all name parts before testing because the schema permits NULLs in all parts.
-- Each sub-query picks one name at random from an appropriate table of names;
-- predicates such as FName LIKE '%' in the sub-queries force the sub-query to be 
-- executed once for each student row (create "correlated sub-queries")
smurthys commented 7 years ago

This code now reads so well. Thanks @zbhujwala for your effort.

I have some observations to share, but unfortunately cannot write them presently because I have to attend to other matters before my travel begins. I will look at this conversation again soon, and I am certain other reviewers (or even @zbhujwala himself) will have chimed in by then.

afig commented 7 years ago

Nice work on this file @zbhujwala. All of the changes addressed so far look great.

I have some observations on the latest version of the file (the one in commit e20ffc6) some of these observations may have been noted earlier:

The rest of the observations are line specific:

smurthys commented 7 years ago

In addition to @afig's observations:

On the logic of testing: The idea is the script can be safely run on Student table anytime and it will detect and humanize only rows that need to be humanized, that is, detect rows with hashes in all name parts. The question then is how to detect if a name has hashes. The proposal is to see if all name parts are composed of only hex digits with the rationale that no human name would pass that test. I worry I may be missing something.

For this purpose, it is necessary and sufficient to test if concat(FName, MName, LName) contains only hex digits. (concat ignores NULL values, so no need to COALESCE).

Yes, the current predicate does not test if every character is a hex digit; it should. And, it should test 0-9, a-f, and A-F.

zbhujwala commented 7 years ago

Some notes about commit c45fc13:

Changed regex in order to find appropriate MD5 hashed values of name fields

The current version of this file uses the regular expression FName ~* '[0-9a-f]{32}' By using ~*, the expression will become case insensitive and match patterns that are either a-f or A-F. The reason for using {32} is that MD5 output string length is always 32 characters.

The WHERE clause on the UPDATE is now

WHERE (FName ~* '[0-9a-f]{32}')
        OR (MName ~* '[0-9a-f]{32}')
        OR (LName ~* '[0-9a-f]{32}');

By using this expression, we are able to handle areas where one (or more) of the name fields are NULL. Dr. Murthy suggested using 'concat()' on all name fields but doing so would change the regex where it is checking if the string length is correct for the corresponding hash value.

Seriously, what does the sub-query return if FName is NULL? Note that the WHERE clause uses the AND operator.

By using the new WHERE clause, if a field has a NULL but the other fields contain an MD5 hash, the Student will be updated to have a human name. If they have a NULL name field but the other fields contain a "human" name, the field will not be altered (this SQL procedure is idempotent).

Changed the way subqueries in the UPDATE are made volatile

The subqueries now use SELECT DISTINCT trunc(random() * numOfLastNames + 1) * (length(LName) * 0 + 1 ). The purpose of the SELECT DISTINCT was to combat the problem that the WHERE clause would be unpredictable and return 0 to many rows when called. As for the expression * (length(LName) * 0 + 1 ), this is another dependency to make the subquery correlated to the outer query. In effect, all this expression does is multiply the random number that is generated by 1.

EDIT: Formatting

afig commented 7 years ago

Thanks for the addressing the issues. All changes seem like they're significant improvements to the previous version of the file.

I have a couple of optimizations that can be considered, which I will add either later tonight or tomorrow once I get a chance to review the file more in-depth.

afig commented 7 years ago

I agree that fields which were NULL should remain NULL. The current solution takes this into account, which is great. We just have to make sure any future solutions maintain this behavior.

Great work on the new comments, they summarize the behavior very well.

Here are a few points to consider that can each improve the readability and/or efficiency of the script

smurthys commented 7 years ago

I will leave the rest of the discussion to others, but a quick note on the OR expression in the final WHERE clause:

Consider testing the string COALESCE(FName, LName, MName) instead of testing each name part in an OR expression, because SQL does not do Boolean short-circuiting. Also, this expression can still be tested for 32 characters.

I am a bit apprehensive about testing for 32 characters, because that test assumes MD5, but it is acceptable for now.

EDIT: My suggestion to test COALESCE(FName, LName, MName) has merit, but requires some guarantees from other parts of the system and I don't think those guarantees are in place yet. So, for now, please ignore pretty much the entirety of this comment. We can revisit this if necessary after this PR is merged.

zbhujwala commented 7 years ago

Was there an issue with the previous LIKE '%'? The current workaround to the bug in Postgres makes it harder to see that it does not have any side-effects.

A couple issues with LIKE '%' were:

smurthys commented 7 years ago

@zbhujwala thank you for your effort.

I am having difficulty justifying a correlated sub-sub-query as a solution. Also, addressing the issues discussed in this conversation requires a holistic approach backed by well-planned unit tests. I worry that by now this PR has taken too much of your time.

In the interest of getting this conversation to converge and this PR towards merging, I propose changing the file to look like that in commit c45fc13, but change the sub-queries in the SET clause to look as follows:

FName = (SELECT Name
         FROM HumanFirstNames
         WHERE FName LIKE '%'
         AND HumanFirstNames.ID = trunc(random() * numOfFirstNames  + 1)
         LIMIT 1
        ),
--similarly set values of MName and LName

This will leave the code with some known issues and thus likely unstable. Leave a clear note in the next commit that this branch is unstable so no one will use the branch in a build.

After the commit with the proposed change is made, all reviewers are welcome to comment with suggestions on how to address issues and documenting other issues. Also, anyone is welcome to make new commits to the branch provided the commits are verified and converge to the desired final state. If the branch has not made any progress by the time I return from vacation, I will see if I can continue the work.

@hunterSchloss: Please consider dismissing your earlier approval of this PR.

smurthys commented 7 years ago

I just pushed commits ca676c8 and 59f59d6 which address the issues identified in the script humanizeStudent.sql. They also add a test directory containing a couple of test scripts and a README file.

FYI, the script humanizeStudent.sql has changed quite a bit and requires a thorough review: almost as if this is the first version of the file.

zbhujwala commented 7 years ago

@smurthys Thank you for the commits and the testing scripts.

The new version of humanizeStudentData.sql works well and I have not come across any noticeable errors from testing it on my instance of Postgres.

I like the interesting technique of handling the way of limiting NULL values from being generated for a name field by running the function up to three times for each field (although I am not sure running a function 3 times for each name field to be humanized would be efficient due to how many queries are executed, but if it is the best way to deal with the situation it will have to remain.)

The dynamic SQL portion also seems like a great way to compact the overall size of the SQL file instead of maintaining a function for each name field if we were to implement it.

My one minor concern after running humanizeStudentData.sql is the consistency of having a specific field name as NULL or an empty string. I know the script states that currentName will remain either NULL or an empty string after the script is executed and the original fields contained either of those two but shouldn't the script change all NULLs to empty strings or vice versa on the output after this script is executed? It would seem that it would be helpful if a norm is chosen so that other .sql files can better expect what the field name consists of.

smurthys commented 7 years ago

@zbhujwala Thanks for the review.

The COALESCE function evaluates only the arguments it needs to:

COALESCE only evaluates the arguments that are needed to determine the result; that is, arguments to the right of the first non-null argument are not evaluated.

On replacing NULL with empty string or vice versa: Humanization is an optional process, and indeed the process will never be run in a production system. Thus, if consistency is the reason to replace one value with another, the issue needs to be addressed independent of humanization and across the entire application.

BTW, data designers have debated this topic extensively for decades and the simple pragmatic conclusion invariably is that NULL and empty string (in this case) mean different things. If an application wishes to replace NULL with empty string in a field, it should not at all permit NULL in that field.

smurthys commented 7 years ago

Thanks @afig for your review and the recognition.

Commit d48f5ba fixes the issues @afig identified as well as a few other issues I spotted.