chembl / GLaDOS

Web Interface for ChEMBL @ EMBL-EBI
https://www.ebi.ac.uk/chembl/
Other
46 stars 6 forks source link

ChEMBL relational database ER diagram #1306

Open bioada opened 1 year ago

bioada commented 1 year ago

Hello,

Unfortunately, the ER diagram does not follow the Normal Form rules and has many errors. It means the resultsets from the JOIN queries are not reliable.

Best, Saed

bioada commented 1 year ago

Before I explain any specific normalization issue with the ChEMBL ER diagram, I define a few terms for clarification.

Normalization: A process of organizing the data in database to avoid data redundancy and insertion, update and deletion anomaly.

First Normal Form (1NF): There should be only one value in an intersection between a row (entity/record/tuple) and a column (attribute/field)

2NF: 1NF and no partial dependencies. In a table with a combined primary keys, the non-prime attributes should have dependency to all keys in the primary keys.

3NF: 2NF and no transitive dependencies. There should not be a dependency between non-prime attributes

4NF: 3NF and no multivalued dependencies. There should not be more than one many-to-many relationship in a table.

5NF: 4NF and by joining the normalized tables we should not lose or gain rows

bioada commented 1 year ago

Uniqueness of Rows: Before even starting the process of normalization, we need to make sure every row in a table is unique. This uniqueness MUST be based on a set attributes (columns/fields) and NOT based on a proxy attribute such as unique integer or string keys. Unfortunately, this is a mistake in many relational databases. Let me explain it with an example.

We have a table called Winner with two columns (FirstName and LastName) and the following records: Michael, Jordan Mohammad, Ali Michael, Jordan Tim, Clark Milos, Vesna Michael, Jordan

Now, to make it unique for add another column called WinnerID as integer and we make it the primary key.

1, Michael, Jordan 2, Mohammad, Ali 3, Michael, Jordan 4, Tim, Clark 5, Milos, Vesna 6, Michael, Jordan

Now, if the winner number is 6 and we call the winner "Michael Jordan", the question is which "Michael Jordan"? If you say we send the WinnerID to every individual, it means you have a hidden knowledge about every person uniqueness but your table does not know.

WinnerID is a proxy unique key and is nothing to do with a real person and should not be created before defining a set of non-proxy attributes as a unique key (candidate key).

bioada commented 1 year ago

Problem:No unique key(s)

Table: target_dictionary

Proxy Columns: tid and chembl_id are proxy unique keys and cannot be used for confirming uniqueness of rows in the table.

A combined key of the following columns still is not unique: target_type pref_name
tax_id
organism
species_group_flag

Conclusion: target_dictionary does not have unique rows and cannot be included in an ER diagram (a relational database)

Unfortunately, there are many more tables in the ChEMBL database that have the same above issue.

When we proved that all records in target_dictionary table are unique, then we are allowed to use "tid" or "chembl_id" as a proxy primary key. anywhere in the ER diagram.

bioada commented 1 year ago

Why there is a table in a relational database only with one column? table: actiivity_smid column: smid