PecanProject / BETYdb-YABA

Yet Another BETYdb API (for metadata upload)
BSD 3-Clause "New" or "Revised" License
5 stars 18 forks source link

Strip whitespace from beginning and end of all records #31

Open dlebauer opened 4 years ago

dlebauer commented 4 years ago

remove leading and trailing whitespace,

eg. this: ABC DEF ,Lactuca sativa,, should become ABC DEF,Lactuca sativa,,

for all fields

gsrohde commented 4 years ago

@dlebauer et al.: If you want normalized white space to be a database invariant for any particular columns, I strongly recommend implementing this at the database level with a trigger function. Then, no matter from what pathway the value of a column is inserted or updated, one can be assured that the white space adheres to some normal form.

For an example of how this might be done, you can look at the PL/pgSQL trigger function normalize_name_whitespace() which is called whenever a cultivar is inserted or updated: it takes the value for name given in the insert or update statement and strips leading and trailing whitespace and converts any internal sequence of whitespace characters (including tabs and newlines) to a single space character.

A major motivation for this was to give the uniqueness constraint unique_name_per_species more teeth: it doesn't do much good to ensure names are unique if we can still have what are essentially duplicates that vary only in the distribution of white space.

shivanshu1086 commented 4 years ago

Hi @dlebauer, Can I work on this?

dlebauer commented 4 years ago

@shivanshu1086 that would be great - step 1:

@gsrohde - based on your comment above, my understanding is that if you try to put a cultivars.csv with a white space added, the white space should automatically be stripped when it is inserted. (is that correct?). I am pretty sure that this didn't happen, which is why I wrote up this issue. please let me know if the following sounds reasonable:

@shivanshu1086 you could start by creating a reproducible error - see if you can use curl to insert this file cultivars_whitespace_test.csv into the database. If the strip whitespace function is working, it should fail because it violates a uniqueness constraint. If it works, you can query the cultivars table to ensure that the whitespace was stripped (e.g. the record should have the trailing whitespace removed and be converted from DP_075 to DP_075 in the database).

If that works, you can also check that the uniqueness constraint works by trying to upload this file cultivars_uniqueness_test.csv and expect that it will fail with an error that says it violates uniqueness constraint.

All of that just confirms the current behavior.

Next step would be to follow the pattern here: https://github.com/PecanProject/bety/blob/266dea4cd0e40446c0cdba09e1fd8318a3e341b2/db/migrate/20150202220519_add_uniqueness_constraints.rb#L52 image

to implement normalize_name_whitespace() on additional fields. If you can submit a pull request that applies this pattern to the name field of the treatments table as a first step, then during the pull request review we can discuss internally which fields and tables we want to apply this function to.

also, I've removed the "good first issue" tag since it was originally concieved of as working in the python / Flask API rather than as a database trigger function.

gsrohde commented 4 years ago

@dlebauer I tested this out on the bety7 machine with

update cultivars set name = '           Caddo  lots   of spaces        more spaces        ' where id = 7000000503;

(I probably don't have access to whatever machine you use.) The resulting row looked like this

7000000503 |      2588 | Caddo lots of spaces more spaces   |         |       | 2020-03-02 20:03:57.120819 | 2020-03-02 20:19:39.908616 |

You might want to try doing \d cultivars on whatever database you are using. Check for the three relevant triggers and constraints:

Indexes:
...
    "unique_name_per_species" UNIQUE CONSTRAINT, btree (name, specie_id)
...
Check constraints:
    "normalized_names" CHECK (is_whitespace_normalized(name::text))
...
Triggers:
    normalize_cultivar_names BEFORE INSERT OR UPDATE ON cultivars FOR EACH ROW EXECUTE PROCEDURE normalize_name_whitespace()
...
shivanshu1086 commented 4 years ago

ok @dlebauer , working on that!

shivanshu1086 commented 4 years ago

Hey @dlebauer @gsrohde , I have run these scripts by making a separate request for the cultivars_whitespace_test.csv.

Screenshot 2020-03-04 at 3 50 25 PM

When I am trying a curl request I am getting this error... Can anyone help me out with this?

shivanshu1086 commented 4 years ago

I have done with that error @dlebauer, Please have a look at what I have got with that.

Screenshot 2020-03-04 at 5 13 58 PM

That works, it has successfully inserted the cultivars_whitespace_test.csv into the database.

Then I tried to check the uniqueness constrains by adding the rows into cultivars_whitespace_test.csv from cultivars_uniqueness_test.csv. Then I came up with this.

Screenshot 2020-03-04 at 5 40 08 PM
shivanshu1086 commented 4 years ago

This is what we get from the database:

Screenshot 2020-03-04 at 7 12 15 PM
dlebauer commented 4 years ago

@gsrohde the database I was using has those triggers installed.

Very mysterious. One could try debugging it by putting output statements in the function called by the trigger function, making sure it gets called upon a qualifying event and making sure it does what it is supposed to do.

shivanshu1086 commented 4 years ago

@dlebauer please check #37 that can be helpful for checking the whitespaces for the database.