Open dlebauer opened 4 years ago
@dlebauer et al.: If you want normalized white space to be a database invariant for any particular columns, I strongly recommend implementing this at the database level with a trigger function. Then, no matter from what pathway the value of a column is inserted or updated, one can be assured that the white space adheres to some normal form.
For an example of how this might be done, you can look at the PL/pgSQL trigger function normalize_name_whitespace()
which is called whenever a cultivar is inserted or updated: it takes the value for name
given in the insert or update statement and strips leading and trailing whitespace and converts any internal sequence of whitespace characters (including tabs and newlines) to a single space character.
A major motivation for this was to give the uniqueness constraint unique_name_per_species
more teeth: it doesn't do much good to ensure names are unique if we can still have what are essentially duplicates that vary only in the distribution of white space.
Hi @dlebauer, Can I work on this?
@shivanshu1086 that would be great - step 1:
@gsrohde - based on your comment above, my understanding is that if you try to put a cultivars.csv with a white space added, the white space should automatically be stripped when it is inserted. (is that correct?). I am pretty sure that this didn't happen, which is why I wrote up this issue. please let me know if the following sounds reasonable:
@shivanshu1086 you could start by creating a reproducible error - see if you can use curl to insert this file cultivars_whitespace_test.csv into the database. If the strip whitespace function is working, it should fail because it violates a uniqueness constraint. If it works, you can query the cultivars table to ensure that the whitespace was stripped (e.g. the record should have the trailing whitespace removed and be converted from DP_075
to DP_075
in the database).
If that works, you can also check that the uniqueness constraint works by trying to upload this file cultivars_uniqueness_test.csv and expect that it will fail with an error that says it violates uniqueness constraint.
All of that just confirms the current behavior.
Next step would be to follow the pattern here: https://github.com/PecanProject/bety/blob/266dea4cd0e40446c0cdba09e1fd8318a3e341b2/db/migrate/20150202220519_add_uniqueness_constraints.rb#L52
to implement normalize_name_whitespace() on additional fields. If you can submit a pull request that applies this pattern to the name field of the treatments table as a first step, then during the pull request review we can discuss internally which fields and tables we want to apply this function to.
also, I've removed the "good first issue" tag since it was originally concieved of as working in the python / Flask API rather than as a database trigger function.
@dlebauer I tested this out on the bety7 machine with
update cultivars set name = ' Caddo lots of spaces more spaces ' where id = 7000000503;
(I probably don't have access to whatever machine you use.) The resulting row looked like this
7000000503 | 2588 | Caddo lots of spaces more spaces | | | 2020-03-02 20:03:57.120819 | 2020-03-02 20:19:39.908616 |
You might want to try doing \d cultivars
on whatever database you are using. Check for the three relevant triggers and constraints:
Indexes:
...
"unique_name_per_species" UNIQUE CONSTRAINT, btree (name, specie_id)
...
Check constraints:
"normalized_names" CHECK (is_whitespace_normalized(name::text))
...
Triggers:
normalize_cultivar_names BEFORE INSERT OR UPDATE ON cultivars FOR EACH ROW EXECUTE PROCEDURE normalize_name_whitespace()
...
ok @dlebauer , working on that!
Hey @dlebauer @gsrohde , I have run these scripts by making a separate request for the cultivars_whitespace_test.csv.
When I am trying a curl request I am getting this error... Can anyone help me out with this?
I have done with that error @dlebauer, Please have a look at what I have got with that.
That works, it has successfully inserted the cultivars_whitespace_test.csv into the database.
Then I tried to check the uniqueness constrains by adding the rows into cultivars_whitespace_test.csv from cultivars_uniqueness_test.csv. Then I came up with this.
This is what we get from the database:
@gsrohde the database I was using has those triggers installed.
Very mysterious. One could try debugging it by putting output statements in the function called by the trigger function, making sure it gets called upon a qualifying event and making sure it does what it is supposed to do.
@dlebauer please check #37 that can be helpful for checking the whitespaces for the database.
remove leading and trailing whitespace,
eg. this:
ABC DEF ,Lactuca sativa,,
should becomeABC DEF,Lactuca sativa,,
for all fields