gaurav / bettertaxonomy

A script for matching against multiple taxonomic sources
1 stars 1 forks source link

bettertaxonomy

BetterTaxonomy: improved name matching against multiple taxonomic databases.

Requirements

You need to install requests. Try pip install requests or easy_install requests.

Usage

  1. Without an internal database.
$ python bettertaxonomy.py example/test_names.txt -f latin -c example/sources.ini
Configuration loaded from example/sources.ini, 5 match lists configured.
latin,matched_scname,matched_acname,matched_url,matched_source,count,class
Aëronautes,Aëronautes,,./data/LookupClassification.txt#331,Internal database,Aëronautes,Aves
A.,A.,,./data/LookupClassification.txt#76,Internal database,-1,Aves
Arvicolinae,,,,,1234567891234567891234567890,Mammlia
Eutamias minimus,"Eutamias minimus (Bachman, 1839)","Tamias minimus Bachman, 1839",http://gbif.org/species/119924203,(GBIF:Catalogue of Life)  7ddf754f-d193-4cc9-b351-99906754a03b,10,Mammalia
Panthera tigris,"Panthera tigris Linnaeus, 1758",,http://gbif.org/species/103371328,"(GBIF:Mammal Species of the World, 3rd edition) Syst. Nat. , 10th ed. vol. 1 p. 41 672aca30-f1b5-43d3-8a2b-c1606125fa1b",20,Mammalia
Felis tigris,,,,,30,Mammalia
tiger,,,,,10,Mammalia
Felix tigris,,,,,10,Mammalia
oopsie,,,,,5,Mammalia

 - Processed on 06/25/14 on file example/test_names.txt in 0:00:05.042998 time.
 - Rows with names processed: 9 (1.78465 rows/second, 0.56033 seconds/row)
 - 4 names (44.44%) were matched against the following sources:
        Catalogue of Life (GB): 1 (25.00%)
        File-based database (./data/LookupClassification.txt): 2 (50.00%)
    Mammal Species of the World, 3rd edition (GB): 1 (25.00%)
 - Names that could not be matched against any checklist: 5 (55.56%)
  1. With an internal database (which is already matched last):
rgnt2-118-209-dhcp:bettertaxonomy vaidyagi$ python bettertaxonomy.py example/test_names.txt -f latin -c example/sources.ini -i example/internal.txt 
Configuration loaded from example/sources.ini, 5 match lists configured.
latin,matched_scname,matched_acname,matched_url,matched_source,count,class
Aëronautes,Aëronautes,,./data/LookupClassification.txt#331,Internal database,Aëronautes,Aves
A.,A.,,./data/LookupClassification.txt#76,Internal database,-1,Aves
Arvicolinae,Arvicolinae,,example/internal.txt#1268,internal,1234567891234567891234567890,Mammlia
Eutamias minimus,"Eutamias minimus (Bachman, 1839)","Tamias minimus Bachman, 1839",http://gbif.org/species/119924203,(GBIF:Catalogue of Life)  7ddf754f-d193-4cc9-b351-99906754a03b,10,Mammalia
Panthera tigris,"Panthera tigris Linnaeus, 1758",,http://gbif.org/species/103371328,"(GBIF:Mammal Species of the World, 3rd edition) Syst. Nat. , 10th ed. vol. 1 p. 41 672aca30-f1b5-43d3-8a2b-c1606125fa1b",20,Mammalia
Felis tigris,,,,,30,Mammalia
tiger,,,,,10,Mammalia
Felix tigris,,,,,10,Mammalia
oopsie,,,,,5,Mammalia

 - Processed on 06/25/14 on file example/test_names.txt in 0:00:09.053734 time.
 - Rows with names processed: 9 (0.99406 rows/second, 1.00597 seconds/row)
 - 5 names (55.56%) were matched against the following sources:
    Catalogue of Life (GB): 1 (20.00%)
    File-based database (./data/LookupClassification.txt): 2 (40.00%)
    Mammal Species of the World, 3rd edition (GB): 1 (20.00%)
    internal: 1 (20.00%)
 - Names that could not be matched against any checklist: 4 (44.44%)

Once the internal database has been updated, new searches will match against it.

$ python bettertaxonomy.py example/test_names.txt -f latin -c example/sources.ini -i example/internal.txt 
Configuration loaded from example/sources.ini, 5 match lists configured.
latin,matched_scname,matched_acname,matched_url,matched_source,count,class
Aëronautes,Aëronautes,,./data/LookupClassification.txt#331,Internal database,Aëronautes,Aves
A.,A.,,./data/LookupClassification.txt#76,Internal database,-1,Aves
Arvicolinae,Arvicolinae,,example/internal.txt#1268,internal,1234567891234567891234567890,Mammlia
Eutamias minimus,"Eutamias minimus (Bachman, 1839)","Tamias minimus Bachman, 1839",http://gbif.org/species/119924203,(GBIF:Catalogue of Life)  7ddf754f-d193-4cc9-b351-99906754a03b,10,Mammalia
Panthera tigris,"Panthera tigris Linnaeus, 1758",,http://gbif.org/species/103371328,"(GBIF:Mammal Species of the World, 3rd edition) Syst. Nat. , 10th ed. vol. 1 p. 41 672aca30-f1b5-43d3-8a2b-c1606125fa1b",20,Mammalia
Felis tigris,Felis tigris,,example/internal.txt#16321,internal,30,Mammalia
tiger,tiger,,example/internal.txt#16322,internal,10,Mammalia
Felix tigris,Felix tigris,,example/internal.txt#16323,internal,10,Mammalia
oopsie,oopsie,,example/internal.txt#16324,internal,5,Mammalia

 - Processed on 06/25/14 on file example/test_names.txt in 0:00:04.929067 time.
 - Rows with names processed: 9 (1.82590 rows/second, 0.54767 seconds/row)
 - 9 names (100.00%) were matched against the following sources:
    Catalogue of Life (GB): 1 (11.11%)
    File-based database (./data/LookupClassification.txt): 2 (22.22%)
    Mammal Species of the World, 3rd edition (GB): 1 (11.11%)
    internal: 5 (55.56%)
 - Names that could not be matched against any checklist: 0 (0.00%)

Configuration file

To use BetterTaxonomy, you need to set up a configuration file. An example file is provided in the distribution. This tells the script which resources to query for species names.

Each configuration file contains a matchers section, which contains a set of Matcher Lists. Each Matcher List is activated by a particular condition in the format column-name ~ value; if the column name for a particular row matches the value, that Matcher List is used to process that row. If no condition can be matched, the Matcher List marked default is used. The example matchers section looks like the following:

[matchers]
class ~ mammalia = msw3, itis, col, paleodb, ncbi, file-example, taxrefine
class ~ aves = avibase, itis, col, paleodb, ncbi, file-example, taxrefine
class ~ reptilia = itis, col, reptile_database, paleodb, ncbi, file-example, taxrefine
class ~ amphibia = amphibiaweb, itis, col, paleodb, ncbi, file-example, taxrefine
phylum ~ chordata = fishbase, itis, col, paleodb, ncbi, file-example, taxrefine
default = file-example, taxrefine

Each Matcher List consists of a series of matchers. Each matcher must be described in its own matcher section; for example, file-example is described in the section matcher:file-example.

[matcher:file-example]
name = File-based database
file = ./data/LookupClassification.txt
scientificName_column = scientificName

The type of a matcher is determined by the settings it contains: gbif_id indicates a GBIF matcher, while file indicates a File matcher.

File matcher

A file matcher matches a name in any file format that Python's CSV module can accept. It accepts the following properties:

An example of a file matcher is as follows:

[matcher:file-example]
name = File-based database
file = ./data/LookupClassification.txt
scientificName_column = scientificName

GBIF matcher

A GBIF matcher matches a name against a particular checklist on GBIF. It accepts the following properties:

An example of a GBIF matcher is as follows:

[matcher:msw3]
name = Mammal Species of the World, 3rd edition
gbif_id = 672aca30-f1b5-43d3-8a2b-c1606125fa1b

GNA matcher

A GNA matcher uses the Global Names Architecture Resolver to match names against one or more checklists imported into GNA.

An example of a GNA matcher is as follows:

[matcher:msw3_itis_col_paleodb_ncbi]                                            
name = MSW3, ITIS, CoL, PaleoDB, NCBI                                           
gna_id = 174, 3, 1, 172, 4