ContentMine / phylotree

A repository for ami-phylotree development
0 stars 0 forks source link

Validating scientific names after `ami-phylo` parse. #44

Open petermr opened 8 years ago

petermr commented 8 years ago

The genus and species fields can be checked against local taxdump/genus.txt and taxdump/species.txt.

(a) write generic TaxdumpLookup (b) integrate it to ami-phylo

petermr commented 8 years ago
org.xmlcml.ami2.lookups.TaxdumpLookup:
    @Test
    public void testGenus() throws Exception {
        TaxdumpLookup taxdumpLookup = new TaxdumpLookup();
        Assert.assertTrue("Mus", taxdumpLookup.isValidGenus("Mus"));
    }

    @Test
    public void testInvalidGenus() throws Exception {
        TaxdumpLookup taxdumpLookup = new TaxdumpLookup();
        Assert.assertFalse("Mickey", taxdumpLookup.isValidGenus("Mickey"));
    }

    @Test
    public void testBinomial() throws Exception {
        TaxdumpLookup taxdumpLookup = new TaxdumpLookup();
        Assert.assertTrue("Mus musculus", taxdumpLookup.isValidBinomial("Mus", "musculus"));
    }

    @Test
    public void testInvalidBinomial() throws Exception {
        TaxdumpLookup taxdumpLookup = new TaxdumpLookup();
        Assert.assertFalse("Mickey mouse", taxdumpLookup.isValidBinomial("Mickey", "mouse"));
    }

    @Test
    public void testSpeciesForGenus() throws Exception {
        TaxdumpLookup taxdumpLookup = new TaxdumpLookup();
        List<String> speciesList = taxdumpLookup.lookupSpeciesList("Zyzzyzus");
        Assert.assertEquals("Zyzzyzus", "[calderi, warreni]", speciesList.toString());
    }
petermr commented 8 years ago

Current progress:

added TaxdumpLookup to editTestLabels()

        TaxdumpLookup taxdumpLookup = new TaxdumpLookup();

                String genus = otu.getAttributeValue("genus", PhyloConstants.CM_PHYLO_NS);
                String species = otu.getAttributeValue("species", PhyloConstants.CM_PHYLO_NS);
                LOG.debug("genus>"+genus+": "+taxdumpLookup.isValidGenus(genus));
                LOG.debug("binomial>"+genus+" "+species+": "+taxdumpLookup.isValidBinomial(genus, species));
validated: Bacillus subtilis 168 (NC_00964) => Bacillus subtilis 168 NC_00964
genus>Bacillus: true
binomial>Bacillus subtilis: true
failed validate: null
failed validate: null
validated: Proprinogenum modestus DSM 2376T (AJ307978) => Proprinogenum modestus DSM 2376T AJB07978; [3__B]
genus>Proprinogenum: false
binomial>Proprinogenum modestus: false
validated: Clostridium botulinum serotype e (M94261) => Clostridium botulinum serotype e M94261
genus>Clostridium: true
binomial>Clostridium botulinum: true
validated: Streptococcus gordonii CH1 (NC_OO9785) => Streptococcus gordonii CH1 NC_009785; [O__0, O__0]
genus>Streptococcus: true
binomial>Streptococcus gordonii: true
validated: Jonquetella anthropi E3_33 (EU840722) => Jonquetella anthropi E3_33 EUB40722; [8__B]
genus>Jonquetella: true
binomial>Jonquetella anthropi: true
validated: Pseudomonas aeruginosa PAO1 (NC_OO2516) => Pseudomonas aeruginosa PAO1 NC_002516; [O__0, O__0]
genus>Pseudomonas: true
binomial>Pseudomonas aeruginosa: true
validated: Thermotoga maritime MSBBT (NC_O00853) => Thermotoga maritime MSBBT NC_000853; [O__0]
genus>Thermotoga: true
binomial>Thermotoga maritime: false
failed validate: null
validated: Mycobacterium tuberculosis H37Ra (NC_0O9525) => Mycobacterium tuberculosis H37Ra NC_009525; [O__0]
genus>Mycobacterium: true
binomial>Mycobacterium tuberculosis: true
validated: Ochrobactrum anthropi ATCC 49188T (NC_0O9667) => Ochrobactrum anthropi ATCC 49188T NC_009667; [O__0]
genus>Ochrobactrum: true
binomial>Ochrobactrum anthropi: true
validated: Fusobacterium nucleatum DSM 20482 (AJ307974) => Fusobacterium nucleatum DSM 20482 AJB07974; [3__B]
genus>Fusobacterium: true
binomial>Fusobacterium nucleatum: true
validated: Caulobacter crescentus CB15 (NC_0O2696) => Caulobacter crescentus CB15 NC_002696; [O__0]
genus>Caulobacter: true
binomial>Caulobacter crescentus: true
failed validate: null
validated: Borrelia burgdorferi B31T (NC_O01218) => Borrelia burgdorferi B31T NC_001218; [O__0]
genus>Borrelia: true
binomial>Borrelia burgdorferi: true
validated: Chlorobium tepidum TLST (NC_OO2932) => Chlorobium tepidum TLST NC_002932; [O__0, O__0]
genus>Chlorobium: true
binomial>Chlorobium tepidum: true
validated: Finegoldia magna ATCC 29328 (NC_010376) => Finegoldia magna ATCC 29328 NC_010376
genus>Finegoldia: true
binomial>Finegoldia magna: true
validated: Bordetella pertussis Tohama (NC_0O2929) => Bordetella pertussis Tohama NC_002929; [O__0]
genus>Bordetella: true
binomial>Bordetella pertussis: true
validated: Neisseria gonorrhoeae FA1090 (NC_002946) => Neisseria gonorrhoeae FA1090 NC_002946
genus>Neisseria: true
binomial>Neisseria gonorrhoeae: true
validated: Pyramidobacter piscolens W5455T (EU379932) => Pyramidobacter piscolens W5455T EUB79932; [3__B]
genus>Pyramidobacter: true
binomial>Pyramidobacter piscolens: true
validated: Haemophilus influenzae RdKW20 (U32697) => Haemophilus influenzae RdKW20 U32697
genus>Haemophilus: true
binomial>Haemophilus influenzae: true
failed validate: null
validated: Synergistes jonesii ATCC 49833T (EU840723) => Synergistes jonesii ATCC 49833T EUB40723; [8__B]
genus>Synergistes: true
binomial>Synergistes jonesii: true
validated: Optiutus terrae PBQO-1T (NC_010571) => Optiutus terrae PBQO-1T NC_010571
genus>Optiutus: false
binomial>Optiutus terrae: false
validated: Porphyromonas gingivalis W83 (AEO15924) => Porphyromonas gingivalis W83 AEO15924
genus>Porphyromonas: true
binomial>Porphyromonas gingivalis: true
validated: Bacteroides fragilis ATCC 252857 (NC_OO3228) => Bacteroides fragilis ATCC 252857 NC_003228; [O__0, O__0]
genus>Bacteroides: true
binomial>Bacteroides fragilis: true
failed validate: null
validated: Bifidobacterium longum NCC2705 (NC_0O4307) => Bifidobacterium longum NCC2705 NC_004307; [O__0]
genus>Bifidobacterium: true
binomial>Bifidobacterium longum: true
validated: Rhodopirellula baltica SH 1T (NC_005027) => Rhodopirellula baltica SH 1T NC_005027
genus>Rhodopirellula: true
binomial>Rhodopirellula baltica: true
validated: Mycoplasma pneumoniae M129 (NC_00O912) => Mycoplasma pneumoniae M129 NC_000912; [O__0]
genus>Mycoplasma: true
binomial>Mycoplasma pneumoniae: true:

Most validate... binomial>Proprinogenum modestus: false actually: Propionigenium modestum Serious author-side typo !!!!

binomial>Thermotoga maritime: false "maritime" is Tesseracts (wrong) guess for "maritima"

genus>Optiutus: false "Optiutus" is a misprint for "Opitutus"

rossmounce commented 8 years ago

excellent :)

petermr commented 8 years ago

Latest correction results:

syntax OK: Bacillus subtilis 168 (NC_00964) => Bacillus subtilis 168 NC_00964
incorrect syntax: Lactoba
incorrect syntax: Desulfo
syntax OK: Proprinogenum modestus DSM 2376T (AJ307978) => Proprinogenum modestus DSM 2376T AJB07978; [3__B]
***corrected to: Propionigenium modestum
syntax OK: Clostridium botulinum serotype e (M94261) => Clostridium botulinum serotype e M94261
syntax OK: Streptococcus gordonii CH1 (NC_OO9785) => Streptococcus gordonii CH1 NC_009785; [O__0, O__0]
syntax OK: Jonquetella anthropi E3_33 (EU840722) => Jonquetella anthropi E3_33 EUB40722; [8__B]
syntax OK: Pseudomonas aeruginosa PAO1 (NC_OO2516) => Pseudomonas aeruginosa PAO1 NC_002516; [O__0, O__0]
syntax OK: Thermotoga maritime MSBBT (NC_O00853) => Thermotoga maritime MSBBT NC_000853; [O__0]
***corrected to: Thermotoga maritima
incorrect syntax: lsynechococcus elongatusl PCC 6301 (NC_0O6576)
syntax OK: Mycobacterium tuberculosis H37Ra (NC_0O9525) => Mycobacterium tuberculosis H37Ra NC_009525; [O__0]
syntax OK: Ochrobactrum anthropi ATCC 49188T (NC_0O9667) => Ochrobactrum anthropi ATCC 49188T NC_009667; [O__0]
syntax OK: Fusobacterium nucleatum DSM 20482 (AJ307974) => Fusobacterium nucleatum DSM 20482 AJB07974; [3__B]
syntax OK: Caulobacter crescentus CB15 (NC_0O2696) => Caulobacter crescentus CB15 NC_002696; [O__0]
incorrect syntax: Es
syntax OK: Borrelia burgdorferi B31T (NC_O01218) => Borrelia burgdorferi B31T NC_001218; [O__0]
syntax OK: Chlorobium tepidum TLST (NC_OO2932) => Chlorobium tepidum TLST NC_002932; [O__0, O__0]
syntax OK: Finegoldia magna ATCC 29328 (NC_010376) => Finegoldia magna ATCC 29328 NC_010376
syntax OK: Bordetella pertussis Tohama (NC_0O2929) => Bordetella pertussis Tohama NC_002929; [O__0]
syntax OK: Neisseria gonorrhoeae FA1090 (NC_002946) => Neisseria gonorrhoeae FA1090 NC_002946
syntax OK: Pyramidobacter piscolens W5455T (EU379932) => Pyramidobacter piscolens W5455T EUB79932; [3__B]
syntax OK: Haemophilus influenzae RdKW20 (U32697) => Haemophilus influenzae RdKW20 U32697
incorrect syntax: 
syntax OK: Synergistes jonesii ATCC 49833T (EU840723) => Synergistes jonesii ATCC 49833T EUB40723; [8__B]
syntax OK: Optiutus terrae PBQO-1T (NC_010571) => Optiutus terrae PBQO-1T NC_010571
***corrected to: Opitutus terrae
syntax OK: Porphyromonas gingivalis W83 (AEO15924) => Porphyromonas gingivalis W83 AEO15924
syntax OK: Bacteroides fragilis ATCC 252857 (NC_OO3228) => Bacteroides fragilis ATCC 252857 NC_003228; [O__0, O__0]
incorrect syntax: tAquifex aeolicusl VF5 (NC_000918)
syntax OK: Bifidobacterium longum NCC2705 (NC_0O4307) => Bifidobacterium longum NCC2705 NC_004307; [O__0]
syntax OK: Rhodopirellula baltica SH 1T (NC_005027) => Rhodopirellula baltica SH 1T NC_005027
syntax OK: Mycoplasma pneumoniae M129 (NC_00O912) => Mycoplasma pneumoniae M129 NC_000912; [O__0]

Note the 3 corrections (***). The first has author-side errors in both the Genus and the species!! The algorithm applies the DamerauLevenshteinAlgorithm for iterating through the known genera and species and finding the smallest edit-distances. It currently assumes only one "best match" for the genus and one "best match" for the species. It's possible this might miss som pathological optima, and this could be revisited if necessary.