IndEcol / IE_data_commons

Code and documentation for a commons of structured industrial ecology data
MIT License
23 stars 2 forks source link

How to deal with case-sensitive values? #11

Closed nheeren closed 5 years ago

nheeren commented 6 years ago

The input tables contain often redundant entries like "carbon-17" and "Carbon-17".

How do you want to deal with that? mySQL seems to be case sensitive, but you have to kind of force it to: https://stackoverflow.com/a/5629121/2075003

I think I would suggest to always force all comparisons to lowercase. However, that would require basically a redo of the database. What do you think?

For now I will manually fix all data input tables... -- see notes in #10 😠

stefanpauliuk commented 6 years ago

Lowercase is the default but for abbreviations and proper names we need upper cases (USA, PVC, etc.).

The substance list in 1_UPI_USLCI_Aluminum_cold_rolling_at_plant.xlsx was extracted from the US LCI database, and here as well as in other datasets, it is the responsibility of the data providers to make sure that the classification and classification items used and provided are correct (i.e., all of the same dimension and no redundancies or spelling mistakes).

In the case of 1_UPI_USLCI_Aluminum_cold_rolling_at_plant.xlsx, I was not careful enough when extracting the list. Hence, you just reject those data templates and I will need to fix the classifications in them ;)

nheeren commented 6 years ago

(Also referring to your email).

The problem is that we will keep running in annoying issues. Pandas, Python, and SQL use functions that are sometimes case sensitive and sometimes not.

For instance, it took me almost an hour to figure out why my check for existing classification_items would pass, but the database would reject the upload since it did not find the corresponding foreign key. To make things worse, normal SQL SELECT statements are not case-sensitive.

Therefore, I was thinking it might be simplest to force everything lowercase. Then there is a clear policy. I don't think there can ever be cases where the uppercase really matters. Would be good to have a feedback from someone with a lot of database experience. @konstantinstadler maybe?

konstantinstadler commented 6 years ago

Not much I can add here. From a user perspective I agree with @stefanpauliuk that if sometimes feels strange to add everything lowercase (e.g. ISO3 country abbreviations). Are you planning that users (data uploaders) directly interact with the database in SQL or do they always have to go through some script. In the latter I would recommend a "translating" function which converts everything to lowercase. This also gives you the possibility to further extend that functionality to other problems: U$ -> USD, convert countrynames to ISO3 or whatever, ... The translation could also be a standalone testing script which user are recommended to run and it logs all changes to the users - (s)he can than decide if the changes are reasonable

nheeren commented 5 years ago

We kept case-sensitivity intact. This will require attention by data providers... May need to reopen later...