ContentMine / ami

Apache License 2.0
13 stars 14 forks source link

ami2-species Name detection improvements #25

Open rossmounce opened 9 years ago

rossmounce commented 9 years ago

Now that we're publishing FACTS to http://facts.contentmine.org/ (which is great!), it would be good to make some simple adjustments to the ami-species plugin to filter-out some false positives e.g. the below 'Mbp':

2015-09-19-144453_1258x736_scrot

May I suggest that for two- or three-letter matches (only), these must be further checked against a whitelist of genera provided by the NCBI taxdump? There really are only a very few two- or three-letter genera so hopefully this whitelist approach for the shortest of names wouldn't be too computationally costly?

Examples of two- or three-letter genera to whitelist include: (There are no one letter genera you'll be glad to know!)

Pan https://en.wikipedia.org/wiki/Pan_(genus) Ia https://en.wikipedia.org/wiki/Ia_(genus) Aa https://en.wikipedia.org/wiki/Aa_(plant)

petermr commented 9 years ago

We can create a regex such as

(Pan|Ia|Aa)|[A-Z][a-z]?\.|[A-Z][a-z]{3,})

would that do?

rossmounce commented 9 years ago

Yes, perfect.

I shall look into filtering two / three letter names from NCBI taxdump & see if there are any more out there that wouldn't be in NCBI e.g. fossil genera.

rossmounce commented 9 years ago

Here's an impressively full list, made by knowledgeable taxonomists. Suggest we use this: http://mailman.nhm.ku.edu/pipermail/taxacom/2007-January/061176.html

I deduplicated the lists given by Paul Kirk. There are 172 short 2 or 3 letter genus names! https://gist.github.com/rossmounce/cc8bf88f5e9e07a33bf1

blahah commented 9 years ago

Rather than a single long regex, perhaps better to simply compare the genus to each entry in an array? That's all the regex would do anyway, and it's a more convoluted way of doing it.

petermr commented 8 years ago

We should validate all suspected species against a whitelist of genera. This now exist in the dictionaries and is a O(1) check with a Bloom Filter.