carnegiequarry / digitalquarryproject

The Digital Quarry Project brings the fossils of Dinosaur National Monument's Carnegie Quarry online for researchers and the public.
http://www.carnegiequarry.com/modern-quarry/digital-quarry-all-modern-fossils.php#anchor
4 stars 5 forks source link

Fossil data should be normalized for DBMS #13

Open taraneier opened 9 years ago

taraneier commented 9 years ago

Taxa, fossil relationships, currentLocation and probably bone should be normalized into own tables with references back to fossil table. Depending on the data available measurements could/should be normalized as well.

taraneier commented 9 years ago

Here's my proposal for normalizing the database. I got off into the weeds with geologic information with an eye toward reusing this software for other quarries.

One thought I had on fossil groups is whether they are thought to represent an individual animal, or are there other groupings?

Either way, we can rework the data later as we learn more. This isn't set in cement, just a plaster jacket (har har).

db.pdf

skinjester commented 9 years ago

Hi Tara, I don't know very much about databases but have looked up the term "normalization" in that context. Would love for you to describe this a bit further. Can you speak a bit about how normalizing content into their own tables is a good design strategy?

taraneier commented 9 years ago

I should point out that this change will not change any user-facing display except for the addition of phylopic and / or bone on phylopic.

As I understand it the current data on the site is only a tenth of what the quarry has and the plan is to scale to include it all.

Normalized tables allow for greater data consistency, scalability, and storage efficiency. This link is a bit academic but it explains the basics. http://www.studytonight.com/dbms/database-normalization.php

Thea and I briefly discussed refining the software to the point it could be used for datasets from other sites and projects. Standard database practices would be needed at that point. Keys and indicies allow for faster queries as well--important as the data grows.

I have nearly finished the script to convert the data to the new form. It will be easy to tweak if you have suggestions for changes. I can post that later today after some testing.

skinjester commented 9 years ago

Thanks Tara. I think I get it. I can see the conceptual value of organizing (modeling?) data this way. Am I correct in assuming that normalized designs are also more flexible/more efficient to iterate over?

taraneier commented 9 years ago

Modelling is the correct term—the M in MVC. And yes your assumptions are correct..

I expect to automate all the steps needed to normalize the data. The following statements create the taxon table, then inserts the unique taxon values from the original table.

CREATE TABLE IF NOT EXISTS dino.taxon ( id INT NOT NULL AUTO_INCREMENT COMMENT '', name VARCHAR(45) NULL COMMENT '', phylopic VARCHAR(45) NULL COMMENT '', PRIMARY KEY (id) COMMENT '') ENGINE = InnoDB;

INSERT INTO taxon (name) SELECT DISTINCT(taxon) FROM modernQuarryFossilData;

On Oct 25, 2015, at 8:58 AM, Gary Boodhoo notifications@github.com wrote:

Thanks Tara. I think I get it. I can see the conceptual value of organizing (modeling?) data this way. Am I correct in assuming that normalized designs are also more flexible/more efficient to iterate over?

— Reply to this email directly or view it on GitHub.