Open gsrohde opened 9 years ago
I've updated the task description (added notes below the bullets) to clarify where needed
@dlebauer With the updated genus constraint (allowing hyphens), there are two violations (see updated query in "Details" section above). One has "Psychotria grandis" as the genus, and is probably a duplicate. The other has "NULL" (the string, not a real NULL) as the genus and "palm, other" as the common name. If you clean these up, I can add this constraint to the current batch of value constraints.
Note that there are 165 rows where genus = '', but we're allowing that.
I've fixed these. You are certainly welcome to go ahead and apply the
simple fixes - removing a text string "NULL" seems pretty clear; the Genus
also has a clear resolution (I suspect moving the genus name to the genus
column, setting scientificname as Psychotria grandis-duplicate
but you
are in a better position to know how to do this).
If you have any doubt about whether to make a change, feel free to act first, ask later.
On Thu, Feb 26, 2015 at 11:02 AM, gsrohde notifications@github.com wrote:
@dlebauer https://github.com/dlebauer With the updated genus constraint (allowing hyphens), there are two violations. One has "Psychotria grandis" as the genus, and is probably a duplicate. The other has "NULL" (the string, not a real NULL) as the genus and "palm, other" as the common name. If you clean these up, I can add this constraint to the current batch of value constraints.
— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/247#issuecomment-76218081.
I think we can be more restrictive about allowing spaces an periods in species column values be requiring that they only occur as part of ' var. ', ' ssp. ', ' x ', 'L.', etc. The following selected only 42 of the 1729 rows having non-empty species column values:
ebi_production=# select genus, scientificname, species from species where species !~ '^([a-z-]*| var. | ssp. | x | L.($| ))*$';
genus | scientificname | species
-----------------+------------------------------------------+------------------------------------------
Clusia | Clusia | cf.
Hymenolobium | Hymenolobium cf. | cf.
Inga | Inga cf. | cf.
Lyonia | Lyonia cf. | cf.
Microlepia | Microlepia cf. | cf.
Pandanus | Pandanus cf. | cf.
Populus | Populus balsamifera ssp. trichocarpa | balsimifera spp. trichocarpa
Celtis | Celtis | laevigata var reticulata
Argyrodendron | Argyrodendron sp. | sp.
Populus | tremula x tremuloides | P. tremula x 'P.tremuloides
Populus | populus spp | populus spp
Populus | p. maximowiczii x p. berolinensis | maximowiczii x p. berolinensis
Populus | Populus maximowiczii Henry x trichocarpa | Populus maximowiczii Henry x trichocarpa
Cyperus | Cyperus longus L | longus L
Trollius | | Trollius europaeus
Populus | Populus balsamifera L. ssp. trichocarpa | Populus balsamifera L. ssp. trichocarpa
Populus | Populus trichocarpa Torr. & A. Gray | trichocarpa Torr. & A. Gray
Sesleria | | Sesleria varia
| Sesleria | Sesleria varia
Potentilla | Potentilla Potentilla aurea | Potentilla aurea
Salix | Salix Polaris | Polaris
Potentilla | Potentilla Potentilla pusilla | Potentilla pusilla
Ranunculus | Ranunculus Ranunculus nemorosus | Ranunculus nemorosus
Primula | Primula Primula glutinosa | Primula glutinosa
Geum | Geum Geum montanum L. | Geum montanum L.
Geum | Geum Geum reptans | Geum reptans
Pedicularis | Pedicularis Pedicularis asplenifolia | Pedicularis asplenifolia
Leucanthemopsis | Leucanthemopsis Leucanthemopsis alpina | Leucanthemopsis alpina
Campanula | Campanula Campanula barbata | Campanula barbata
Gentiana | Gentiana kochiana | Kochiana
Senecio | Senecio incanus | Incanus
Artemisia | Artemisia genipi | Genipi
Agave | Agave salmiana | Salmiana
| Mixed Species succession | Mixed Species succession
Mixed | Mixed Species | Mixed Species succession
Citrus | Citrus x sinensis | x sinensis
| EBI Farm Glycine Max | EBI Farm Glycine Max
Salix | Salix clone | Salix clone
Potentilla | Potentilla Potentilla aurea | Potentilla aurea
Arabis | Arabis caerulea | Caerulea
Veronica | Veronica bellidioides | Bellidioides
Cerastium | Cerastium uniflorum | Uniflorum
(42 rows)
Most of these are obvious errors: missing periods, capitalization, or inclusion of the genus name in the species column.
@dlebauer Shall I go ahead and fix what seem to me like obvious errors (including repeated genus name in the scientificname column)?
Please do. On Thu, Feb 26, 2015 at 3:50 PM gsrohde notifications@github.com wrote:
@dlebauer https://github.com/dlebauer Shall I go ahead and fix what seem to me like obvious errors (including repeated genus name in the scientificname column)?
— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/247#issuecomment-76281591.
from @gsrohde
Box Sync/EBI-modeling/BETYdb/database integrity/informational queries/species_cleanup/out.28715.html
The rows with the id (first column) marked in red are rows that can be deleted without losing any information. (Running the generated script “delete_duplicate_species.sql” against ebi_production will carry out the deletions—both of the species rows themselves and any referencing rows in pfts_species.)
This is a multiple-pass script. Once the obvious deletions are done, the script should be run again in “consolidate” mode. This marks rows that can serve as representitive rows for a group of duplicates and also generates a script “consolidate_species.sql” which will update traits/yields/cultivars/pfts_species references to other rows to point to the “representitive row” and then delete those other rows.
These two passes will eliminate the vast majority of duplicates, but running the script a third time will show the remaining cases that will have to be dealt with manually. Most of these cases are cases where the genus or species or AcceptedSymbol columns have different values in the different rows.
If you’re interested, you can read in more depth about how the script works by running ./xsltproc.sh –i (information) in the Box Sync/EBI-modeling/BETYdb/database integrity/informational queries/species_cleanup/ directory.
@gsrohde
delete_duplicate_species.sql
and consolodate_species.sql
@dlebauer The deletes have been done and the script has been run in consolidation mode with this output: https://gist.github.com/gsrohde/9c989afe686a54b1c62b
The dark green rows will be kept and the others in the same group will be merged with them by updating references. Check over to see if there should be any exceptions.
@dlebauer Here is what remains after doing the "easy" deletions and consolidations:
https://gist.github.com/gsrohde/adc365ef114790b55cf9
There are 19 groups of duplicates (down from 209).
I've updated all of these so that the duplicates match. Generally, these were cases of synonyms, and I kept the "Accepted" synonym chosen by USDA plants.
@dlebauer
Two last groups to fix: https://gist.github.com/gsrohde/175081eaa95e563c8ce4#file-out-25948-html
But even after this, we're still not quite ready to make scientificname
unique because there are 162 rows where scientificname is the empty string. So either we need to clean these up too, or we could, instead, implement a partial index on scientificname that would require non-empty names to be unique.
Partial index works
On Tuesday, March 17, 2015, gsrohde notifications@github.com wrote:
@dlebauer https://github.com/dlebauer
Two last groups to fix: https://gist.github.com/gsrohde/175081eaa95e563c8ce4#file-out-25948-html
But even after this, we're still not quite ready to make scientificname unique because there are 162 rows where scientificname is the empty string. So either we need to clean these up too, or we could, instead, implement a partial index on scientificname that would require non-empty names to be unique.
— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/247#issuecomment-82609281.
All constraints that are checked off are part of pull request #301. Remaining work will be deferred to another migration.
Constraints
spcd
genus
species
scientificname
db/migrate/PROPOSED_CONSTRAINTS
. Included in this is a constraint ensuring uniqueness of scientificname except in the case where it is empty. (See discussion of "partial index" below.)]Cleanups
delete_duplicate_species.sql
consolodate_species.sql
Details
spcd
Current min and max are 0 and 1007.
genus
We would like to require
genus ~ '^([A-Z][-a-z]*)?$'
. Useto view outliers.
species
We will use the following restriction: species should be zero or more space-or-hyphen-separated groups of capital letters followed by a period, sequences of two or more letters possibly followed by a period, ampersands, and times symbols. The check looks like this:
Note that the multiplication sign (×) is the preferred symbol to use for naming hybrids (see http://en.wikipedia.org/wiki/Multiplication_sign) and not the letter "x". Since standardizing the spelling (including symbols used and whitespace) of names will help prevent duplicates, we should use one or the other and not use them interchangeably when storing names in the database. And we may as well use the preferred symbol. Since users will often use the letter "x" in place of the preferred symbol, a trigger function can be used to automatically correct this during INSERTs and UPDATEs.
scientificname
Generally, we would like the scientificname to be the full binomial name of the species, and so usually it should match
genus || ' ' || species
, possibly followed by some qualification (var. ...
orssp. ...
, for example). The queryshows 192 cases where this match fails for some reason other than that the species has been left blank. (Note that the species is blank in all but 1729 of the 50006 rows of the table!) Sometimes this is due to misspellings, but often the genus names are totally different. (Perhaps some of these are synonyms.)
It might be thought that at the very least we can require that the scientificname column value should textually include both the genus value and the species value, but even this fails in case like
scientificname = "P. balsamifera x P. simonii", genus = "Populus", species = "balsamifera x simonii"
.UPDATE 9/15/2015
This query:
shows there are 141 rows where
scientificname
does not textually include bothgenus
andspecies
. In some cases this is because the genus is abbreviated inscientificname
. This should perhaps be allowed. Even so, the queryshows there are 91 violations for other reasons.
Other columns
Constraints on Kingdom, Division, Class, and Family names should probably be similar to that for genus, but we won't concern ourselves with any of the columns to the right of the
updated_at
column for now.