PecanProject / bety

Web-interface to the Biofuel Ecophysiological Traits and Yields Database (used by PEcAn and TERRA REF)
https://www.betydb.org
BSD 3-Clause "New" or "Revised" License
16 stars 38 forks source link

species value constraints: needed cleanup and decisions #247

Open gsrohde opened 9 years ago

gsrohde commented 9 years ago

Constraints

spcd

Current min and max are 0 and 1007.

genus

We would like to require genus ~ '^([A-Z][-a-z]*)?$'. Use

SELECT * FROM species WHERE NOT genus ~ '^([A-Z][-a-z]*)?$';

to view outliers.

species

We will use the following restriction: species should be zero or more space-or-hyphen-separated groups of capital letters followed by a period, sequences of two or more letters possibly followed by a period, ampersands, and times symbols. The check looks like this:

CHECK (species ~ '^(([A-Z]\.|[a-zA-Z]{2,}\.?|&|\u00d7)( |-|$))*$')

Note that the multiplication sign (×) is the preferred symbol to use for naming hybrids (see http://en.wikipedia.org/wiki/Multiplication_sign) and not the letter "x". Since standardizing the spelling (including symbols used and whitespace) of names will help prevent duplicates, we should use one or the other and not use them interchangeably when storing names in the database. And we may as well use the preferred symbol. Since users will often use the letter "x" in place of the preferred symbol, a trigger function can be used to automatically correct this during INSERTs and UPDATEs.

scientificname

Generally, we would like the scientificname to be the full binomial name of the species, and so usually it should match genus || ' ' || species, possibly followed by some qualification (var. ... or ssp. ..., for example). The query

SELECT scientificname, genus, species FROM species WHERE scientificname !~ FORMAT('^%s %s', genus, species) AND species != '';

shows 192 cases where this match fails for some reason other than that the species has been left blank. (Note that the species is blank in all but 1729 of the 50006 rows of the table!) Sometimes this is due to misspellings, but often the genus names are totally different. (Perhaps some of these are synonyms.)

It might be thought that at the very least we can require that the scientificname column value should textually include both the genus value and the species value, but even this fails in case like scientificname = "P. balsamifera x P. simonii", genus = "Populus", species = "balsamifera x simonii".

UPDATE 9/15/2015

This query:

SELECT scientificname, genus, species FROM species WHERE strpos(scientificname, genus) = 0 or strpos(scientificname, species) = 0;

shows there are 141 rows where scientificname does not textually include both genus and species. In some cases this is because the genus is abbreviated in scientificname. This should perhaps be allowed. Even so, the query

SELECT scientificname, genus, species FROM species WHERE (strpos(scientificname, genus) = 0 or strpos(scientificname, species) = 0) and strpos(scientificname, '.') = 0;

shows there are 91 violations for other reasons.

Other columns

Constraints on Kingdom, Division, Class, and Family names should probably be similar to that for genus, but we won't concern ourselves with any of the columns to the right of the updated_at column for now.

dlebauer commented 9 years ago

I've updated the task description (added notes below the bullets) to clarify where needed

gsrohde commented 9 years ago

@dlebauer With the updated genus constraint (allowing hyphens), there are two violations (see updated query in "Details" section above). One has "Psychotria grandis" as the genus, and is probably a duplicate. The other has "NULL" (the string, not a real NULL) as the genus and "palm, other" as the common name. If you clean these up, I can add this constraint to the current batch of value constraints.

Note that there are 165 rows where genus = '', but we're allowing that.

dlebauer commented 9 years ago

I've fixed these. You are certainly welcome to go ahead and apply the simple fixes - removing a text string "NULL" seems pretty clear; the Genus also has a clear resolution (I suspect moving the genus name to the genus column, setting scientificname as Psychotria grandis-duplicate but you are in a better position to know how to do this).

If you have any doubt about whether to make a change, feel free to act first, ask later.

On Thu, Feb 26, 2015 at 11:02 AM, gsrohde notifications@github.com wrote:

@dlebauer https://github.com/dlebauer With the updated genus constraint (allowing hyphens), there are two violations. One has "Psychotria grandis" as the genus, and is probably a duplicate. The other has "NULL" (the string, not a real NULL) as the genus and "palm, other" as the common name. If you clean these up, I can add this constraint to the current batch of value constraints.

— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/247#issuecomment-76218081.

gsrohde commented 9 years ago

I think we can be more restrictive about allowing spaces an periods in species column values be requiring that they only occur as part of ' var. ', ' ssp. ', ' x ', 'L.', etc. The following selected only 42 of the 1729 rows having non-empty species column values:

ebi_production=# select genus, scientificname, species from species where species !~ '^([a-z-]*| var. | ssp. | x | L.($| ))*$';
      genus      |              scientificname              |                 species                  
-----------------+------------------------------------------+------------------------------------------
 Clusia          | Clusia                                   | cf.
 Hymenolobium    | Hymenolobium cf.                         | cf.
 Inga            | Inga cf.                                 | cf.
 Lyonia          | Lyonia cf.                               | cf.
 Microlepia      | Microlepia cf.                           | cf.
 Pandanus        | Pandanus cf.                             | cf.
 Populus         | Populus balsamifera ssp. trichocarpa     | balsimifera spp. trichocarpa
 Celtis          | Celtis                                   | laevigata var reticulata
 Argyrodendron   | Argyrodendron sp.                        | sp.
 Populus         | tremula x tremuloides                    | P. tremula x 'P.tremuloides
 Populus         | populus spp                              | populus spp
 Populus         | p. maximowiczii x p. berolinensis        | maximowiczii x p. berolinensis
 Populus         | Populus maximowiczii Henry x trichocarpa | Populus maximowiczii Henry x trichocarpa
 Cyperus         | Cyperus longus L                         | longus L
 Trollius        |                                          | Trollius europaeus
 Populus         | Populus balsamifera L. ssp. trichocarpa  | Populus balsamifera L. ssp. trichocarpa
 Populus         | Populus trichocarpa Torr. & A. Gray      | trichocarpa Torr. & A. Gray
 Sesleria        |                                          | Sesleria varia
                 | Sesleria                                 | Sesleria varia
 Potentilla      | Potentilla Potentilla aurea              | Potentilla aurea
 Salix           | Salix Polaris                            | Polaris
 Potentilla      | Potentilla Potentilla pusilla            | Potentilla pusilla
 Ranunculus      | Ranunculus Ranunculus nemorosus          | Ranunculus nemorosus
 Primula         | Primula Primula glutinosa                | Primula glutinosa
 Geum            | Geum Geum montanum L.                    | Geum montanum L.
 Geum            | Geum Geum reptans                        | Geum reptans
 Pedicularis     | Pedicularis Pedicularis asplenifolia     | Pedicularis asplenifolia
 Leucanthemopsis | Leucanthemopsis Leucanthemopsis alpina   | Leucanthemopsis alpina
 Campanula       | Campanula Campanula barbata              | Campanula barbata
 Gentiana        | Gentiana kochiana                        | Kochiana
 Senecio         | Senecio incanus                          | Incanus
 Artemisia       | Artemisia genipi                         | Genipi
 Agave           | Agave salmiana                           | Salmiana
                 | Mixed Species succession                 | Mixed Species succession
 Mixed           | Mixed Species                            | Mixed Species succession
 Citrus          | Citrus x sinensis                        | x sinensis
                 | EBI Farm Glycine Max                     | EBI Farm Glycine Max
 Salix           | Salix clone                              | Salix clone
 Potentilla      | Potentilla Potentilla aurea              | Potentilla aurea
 Arabis          | Arabis caerulea                          | Caerulea
 Veronica        | Veronica bellidioides                    | Bellidioides
 Cerastium       | Cerastium uniflorum                      | Uniflorum
(42 rows)

Most of these are obvious errors: missing periods, capitalization, or inclusion of the genus name in the species column.

gsrohde commented 9 years ago

@dlebauer Shall I go ahead and fix what seem to me like obvious errors (including repeated genus name in the scientificname column)?

dlebauer commented 9 years ago

Please do. On Thu, Feb 26, 2015 at 3:50 PM gsrohde notifications@github.com wrote:

@dlebauer https://github.com/dlebauer Shall I go ahead and fix what seem to me like obvious errors (including repeated genus name in the scientificname column)?

— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/247#issuecomment-76281591.

dlebauer commented 9 years ago

from @gsrohde

Box Sync/EBI-modeling/BETYdb/database integrity/informational queries/species_cleanup/out.28715.html

See live view of gist here

The rows with the id (first column) marked in red are rows that can be deleted without losing any information. (Running the generated script “delete_duplicate_species.sql” against ebi_production will carry out the deletions—both of the species rows themselves and any referencing rows in pfts_species.)

This is a multiple-pass script. Once the obvious deletions are done, the script should be run again in “consolidate” mode. This marks rows that can serve as representitive rows for a group of duplicates and also generates a script “consolidate_species.sql” which will update traits/yields/cultivars/pfts_species references to other rows to point to the “representitive row” and then delete those other rows.

These two passes will eliminate the vast majority of duplicates, but running the script a third time will show the remaining cases that will have to be dealt with manually. Most of these cases are cases where the genus or species or AcceptedSymbol columns have different values in the different rows.

If you’re interested, you can read in more depth about how the script works by running ./xsltproc.sh –i (information) in the Box Sync/EBI-modeling/BETYdb/database integrity/informational queries/species_cleanup/ directory.

dlebauer commented 9 years ago

@gsrohde

gsrohde commented 9 years ago

@dlebauer The deletes have been done and the script has been run in consolidation mode with this output: https://gist.github.com/gsrohde/9c989afe686a54b1c62b

The dark green rows will be kept and the others in the same group will be merged with them by updating references. Check over to see if there should be any exceptions.

gsrohde commented 9 years ago

@dlebauer Here is what remains after doing the "easy" deletions and consolidations:

https://gist.github.com/gsrohde/adc365ef114790b55cf9

rendered as html

There are 19 groups of duplicates (down from 209).

dlebauer commented 9 years ago

I've updated all of these so that the duplicates match. Generally, these were cases of synonyms, and I kept the "Accepted" synonym chosen by USDA plants.

gsrohde commented 9 years ago

@dlebauer

Two last groups to fix: https://gist.github.com/gsrohde/175081eaa95e563c8ce4#file-out-25948-html

But even after this, we're still not quite ready to make scientificname unique because there are 162 rows where scientificname is the empty string. So either we need to clean these up too, or we could, instead, implement a partial index on scientificname that would require non-empty names to be unique.

dlebauer commented 9 years ago

Partial index works

On Tuesday, March 17, 2015, gsrohde notifications@github.com wrote:

@dlebauer https://github.com/dlebauer

Two last groups to fix: https://gist.github.com/gsrohde/175081eaa95e563c8ce4#file-out-25948-html

But even after this, we're still not quite ready to make scientificname unique because there are 162 rows where scientificname is the empty string. So either we need to clean these up too, or we could, instead, implement a partial index on scientificname that would require non-empty names to be unique.

— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/247#issuecomment-82609281.

gsrohde commented 9 years ago

All constraints that are checked off are part of pull request #301. Remaining work will be deferred to another migration.