GlobalNamesArchitecture / biodiversity

Scientific Name Parser
MIT License
33 stars 7 forks source link

Name elements that indicate taxon is a virus #7

Closed KatjaSchulz closed 9 years ago

KatjaSchulz commented 9 years ago

It looks like names with these elements are not yet recognized as viruses, so capitalized words are stripped from the canonical form:

NPV, e.g., Papilio polyxenes NPV: http://eol.org/pages/41592578 RNA, e.g., Alternaria zinniae dsRNA element: http://eol.org/pages/11611917 virophage, e.g., Organic Lake virophage: http://eol.org/pages/20868817 satellites, e.g., Double-stranded RNA satellites: http://eol.org/pages/11603787 satellite, e.g., Whitefly VEM satellite: http://eol.org/pages/20858522 betasatellite, e.g., Tomato leaf curl China betasatellite: http://eol.org/pages/11603870 alphasatellite, e.g., Ageratum yellow vein Singapore alphasatellite: http://eol.org/pages/39738381 particle, e.g., Mouse Intracisternal A-particle: http://eol.org/pages/11609198 subgroup, e.g., Subgroup B: http://eol.org/pages/11623168 -- This is probably not limited to viruses, but it's very unlikely that any name that has this string in it will have author information associated with it.

dimus commented 9 years ago

Thanks Katja, I will look at these words through GN names to see if I get some unexpected consequences (unlikely) and everything that is safe will go to the next version of parser

dimus commented 9 years ago

RNA happens in surrogate names, so it is a bit dangerous to say everything that has RNA word are viruses, but probably is ok (if there are no other indications it is a virus) to just refuse to parse:

|Candida albicans RNA_12C-1 | | Candida albicans RNA_12C-2 | | Candida albicans RNA_12C-3 | | Candida albicans RNA_CTR0-1 | | Candida albicans RNA_CTR0-2 | | Candida albicans RNA_CTR0-3 | | Candida albicans RNA_GC75-1 | | Candida albicans RNA_GC75-2 | | Candida albicans RNA_GC75-3 | | Candida albicans RNA_SC5314-1 | | Candida albicans RNA_SC5314-2 | | Candida albicans RNA_SC5314-3 | | Candida tropicalis RNA_Ct1-1 | | Candida tropicalis RNA_Ct1-2 | | Candida tropicalis RNA_Ct2-1 | | Candida tropicalis RNA_Ct2-2 | | Candida tropicalis RNA_Ct3-1 | | Candida tropicalis RNA_Ct3-2

dimus commented 9 years ago

I am not sure what to do with subgroup. Here are some examples:

xLevivirus subgroup I plant rhabdoviruses subgroup B Zelotes mayanus subgroup Teuchophorus notabilis subgroup Meuffels & Grootaert 2004 Subgroup I Geminivirus Sericania mimica subgroup Kobayashi & Fujioka 2008 Pipistrellus (Hypsugo) imbricatus subgroup

I will leave subgroup as is until I understand how to deal with it

dimus commented 9 years ago

Everything except subgroups is covered in v3.1.10