Open rossmounce opened 9 years ago
We can create a regex such as
(Pan|Ia|Aa)|[A-Z][a-z]?\.|[A-Z][a-z]{3,})
would that do?
Yes, perfect.
I shall look into filtering two / three letter names from NCBI taxdump & see if there are any more out there that wouldn't be in NCBI e.g. fossil genera.
Here's an impressively full list, made by knowledgeable taxonomists. Suggest we use this: http://mailman.nhm.ku.edu/pipermail/taxacom/2007-January/061176.html
I deduplicated the lists given by Paul Kirk. There are 172 short 2 or 3 letter genus names! https://gist.github.com/rossmounce/cc8bf88f5e9e07a33bf1
Rather than a single long regex, perhaps better to simply compare the genus to each entry in an array? That's all the regex would do anyway, and it's a more convoluted way of doing it.
We should validate all suspected species against a whitelist of genera. This now exist in the dictionaries and is a O(1) check with a Bloom Filter.
Now that we're publishing FACTS to http://facts.contentmine.org/ (which is great!), it would be good to make some simple adjustments to the ami-species plugin to filter-out some false positives e.g. the below 'Mbp':
May I suggest that for two- or three-letter matches (only), these must be further checked against a whitelist of genera provided by the NCBI taxdump? There really are only a very few two- or three-letter genera so hopefully this whitelist approach for the shortest of names wouldn't be too computationally costly?
Examples of two- or three-letter genera to whitelist include: (There are no one letter genera you'll be glad to know!)
Pan https://en.wikipedia.org/wiki/Pan_(genus) Ia https://en.wikipedia.org/wiki/Ia_(genus) Aa https://en.wikipedia.org/wiki/Aa_(plant)