ContentMine / phylotree

A repository for ami-phylotree development
0 stars 0 forks source link

Tidying NCBI taxdump #43

Open petermr opened 9 years ago

petermr commented 9 years ago

The NCBI taxdump contains ca 600K records, looking like:

9   |   Buchnera aphidicola |       |   scientific name |
9   |   Buchnera aphidicola Munson et al. 1991  |       |   synonym |
10  |   "Cellvibrio" Winogradsky 1929   |       |   synonym |
10  |   Cellvibrio  |       |   scientific name |
10  |   Cellvibrio (ex Winogradsky 1929) Blackall et al. 1986 emend. Humphry et al. 2003    |       |   synonym |
11  |   "Cellvibrio gilvus" Hulcher and King 1958   |       |   authority   |
11  |   Cellvibrio gilvus   |       |   equivalent name |
11  |   [Cellvibrio] gilvus |       |   scientific name |
13  |   Dictyoglomus    |       |   scientific name |
13  |   Dictyoglomus Saiki et al. 1985  |       |   authority   |
14  |   ATCC 35947  |       |   type material   |
14  |   DSM 3960    |       |   type material   |
14  |   Dictyoglomus thermophilum   |       |   scientific name |
14  |   Dictyoglomus thermophilum Saiki et al. 1985 |       |   authority   |
14  |   strain H-6-12   |       |   type material   |
16  |   Methyliphilus   |       |   equivalent name |
16  |   Methylophilus   |       |   scientific name |
16  |   Methylophilus Jenkins et al. 1987   |       |   synonym |
16  |   Methylotrophus  |       |   misspelling |
17  |   ATCC 53528  |       |   type material   |
17  |   DSM 46235   |       |   type material   |
17  |   LMG 6787    |       |   type material   |
17  |   Methyliphilus methylitrophus    |       |   equivalent name |
17  |   Methyliphilus methylotrophus    |       |   equivalent name |
17  |   Methylophilus methylitrophus    |       |   equivalent name |
17  |   Methylophilus methylotrophus    |       |   scientific name |
17  |   Methylophilus methylotrophus Jenkins et al. 1987    |       |   authority   |
17  |   Methylophilus sp. CBMB147   |       |   includes    |
17  |   Methylotrophus methylophilus    |       |   synonym |
17  |   NCIB 10515  |       |   type material   |
17  |   NCIMB 10515 |       |   type material   |
17  |   VKM B-1623  |       |   type material   |
18  |   Pelobacter  |       |   scientific name |
18  |   Pelobacter Schink and Pfennig 1983  |       |   authority   |
19  |   DSM 2380    |       |   type material   |
19  |   NBRC 103641 |       |   type material   |
19  |   Pelobacter carbinolicus |       |   scientific name |
19  |   Pelobacter carbinolicus Schink 1984 |       |   authority   |
19  |   strain Gra Bd 1 |       |   type material   |
20  |   Phenylobacterium    |       |   scientific name |

I have looked through this in some detail and propose that for current ami-phylo (mainly tackling IJSEM) we use ONLY "scientific name"s. These name have the following forms:

Genus
Genus species
Genus species qualifiers...

The qualifiers include:

sp. ddd
subsp. ddd
NTCC 1234

and many more.

These are then sorted and duplicates removed, leaving only single-word genus or two word binomial.

petermr commented 9 years ago

org.xmlcml.ami2.plugins.phylotree.ScientificNameList now reads and processes raw names.dmp from NCBI. We will gradually add new rules to filter out false names,

rossmounce commented 9 years ago

Would prefer to keep it species level for the time being (proposal A) rather than genus level (proposal B) - but yes good to mention the possibility.

petermr commented 9 years ago

I think the middle field may be useful for removing some higher level categories (families) but I/we don't understand it well enough yet.

petermr commented 9 years ago

The value of the genus is that we can check for garbles. I wasn't suggesting we did trees at genus level yet.

rossmounce commented 9 years ago

oh, ok

petermr commented 9 years ago

taxdump field 2 (0-based) is a collection of annotations collected in taxdump/class.txt. Probably not useful. Fields with leading uppercase seem entry-specific. The lowercase ones are more generic but not systematic:

<aardvarks>
<acorn worm>
<actinomycete>
<actinopterygian>
<agaric fungus>
<agrotis>
<all crocodiles>
<all species>
<allotype>
<alpha subdivision>
<amphibia>
<amphibius>
<amphipod>
<anamorph Cystobasidiales>
petermr commented 9 years ago

taxdump field 4 (0-based) is a set of categories ("roles") of which we cureently omit all except scientific name:

acronym
anamorph
authority
blast name
common name
equivalent name
genbank acronym
genbank anamorph
genbank common name
genbank synonym
in-part
includes
misnomer
misspelling
scientific name
synonym
teleomorph
type material

i.e. entry only selected if:

"scientific name".equals(field[4])
petermr commented 9 years ago

Note: moved summary files back to taxdump folder.

petermr commented 9 years ago

Future editing. We can either filter names.dmp per record (e.g.): :

/^\|1234/ d

will remove record for id 1234 or add generic filters. Since we may wish to revise filters we should probably build a series of filters to be read and run every time the files are rebuilt. These will be regexes read by ScientificNameList.