GlobalNamesArchitecture / gnrd

Global Names Discovery
MIT License
15 stars 0 forks source link

GNRD is grabbing extra words after a taxon name #33

Closed diatomsRcool closed 3 years ago

diatomsRcool commented 3 years ago

GNRD is finding a correct taxon name, but grabbing a following word that it shouldn't. Words like "clade", "complex", and "genome" are very common

data package ID name error type text string notes
497 X. tropicalis genome false positive   should  not have grabbed "genome"
1218 Giardia lamblia genome false positive   should  not have grabbed "genome"
422 Marmota clade false positive   should  not have grabbed the "clade"
743 Cymbella ventricosa false positive   should have grabbed the /turgida too, maybe
956 foraminifera false negative   should have grabbed this and not "benthic foraminifera"
274 Gasteracantha clade false positive Gasteracantha clade as sister should not have captured "clade"
313 T. hudsonicus clade false positive   should not have captured "clade"
252 R. antirrhini complex false positive   should not have captured "complex"
333 C. maculata complex false positive   should not have captured "complex"
333 C. striata complex false positive   should not have captured "complex"
333 C. striata plex false positive   should not have captured "complex", which was on line break
307 Glossina distribution false positive surveys of Glossina distribution to monitor should not have captured "distribution"
225 A. aspera eastern false positive of the A. aspera eastern Caribbean data set should not have captured "eastern"
313 T. hudsonicus genotype false positive   should not have captured "genotype"
317 Ndundulu haplotype false positive   should not have captured "haplotype"
317 Ndundulu kipunji haplotype false positive   should not have captured "haplotype"
252 R. antirrhini lineages false positive associated R. antirrhini lineages, should not have captured "lineages"
313 Tamiasciurus lineages false positive   should not have captured "lineages"
268 Anolis may false positive ecology of Anolis may represent should not have captured "map"
313 Tamiasciurus may false positive Tamiasciurus may lead should not have captured "may"
323 T. truncatus may false positive   should not have captured "may"
329 H. gardneri may false positive in H. gardneri may arise should not have captured "may"
333 C. striata may false positive   should not have captured "may"
252 L. vulgaris origin false positive   should not have captured "origin"
317 Rungwecebus specimen false positive   should not have captured "specimen"
307 G. palpalis subspecies false positive uncharacterized G. palpalis subspecies should not have captured "subspecies"
252 Linaria taxa false positive   should not have captured "taxa"
313 T. hudsonicus used false positive   should not have captured "used"
946 X. clivii after false positive   should not have grabbed "after
1326 Acanthopagrus schlegeli after false positive   should not have grabbed "after" - related to false negative
422 Alaskan broweri false positive   should not have grabbed "Alaskan"
1359 O. tuberosa alliance false positive   should not have grabbed "alliance"
422 American broweri false positive   should not have grabbed "American"
422 Asiatic camtschatica false positive   should not have grabbed "Asiatic"
422 Both caudata false positive   should not have grabbed "both"
684 N. multifasciatus candidate false positive   should not have grabbed "candidate"
1158 S. glynni change false positive   should not have grabbed "change"
432 Caenorhabditis clade false positive   should not have grabbed "clade"
432 Rhabditis clade false positive   should not have grabbed "clade"
684 L. lemairii clade false positive   should not have grabbed "clade"
684 L. callipterus clade false positive   should not have grabbed "clade"
825 D. simulans clade false positive   should not have grabbed "clade"
1044 C. ohridella clade false positive   should not have grabbed "clade"
1359 O. acetosella clade false positive   should not have grabbed "clade"
1359 O. rosea clade false positive   should not have grabbed "clade"
1523 K. septemlobus clade false positive Within the K. septemlobus clade, the haplotypes should not have grabbed "clade"
457 Trichoptera clades false positive   should not have grabbed "clades"
505 C. ignobilis clades false positive   should not have grabbed "clades"
505 C. melampygus clades false positive   should not have grabbed "clades"
684 L. callipterus clades false positive   should not have grabbed "clades"
684 Eretmodus clades false positive   should not have grabbed "clades"
765 Afrotheria clades false positive   should not have grabbed "clades"
1005 Neotropical clades false positive   should not have grabbed "clades"
1044 Phyllonorycter clades false positive   should not have grabbed "clades"
1359 Oxalis clades false positive   should not have grabbed "clades"
726 Aedes taeniorhynchus collection false positive   should not have grabbed "collection"
1027 Boechera stricta collection false positive   should not have grabbed "collection"
448 P. hylocetes complex false positive   should not have grabbed "complex"
448 A. tigrinum complex false positive   should not have grabbed "complex"
448 Ambystoma tigrinum complex false positive   should not have grabbed "complex"
835 I. batatas complex false positive   should not have grabbed "complex"
966 Drosophila serrata complex false positive   should not have grabbed "complex"
1048 Burkholderia cepacia complex false positive   should not have grabbed "complex"
1048 B. cepacia complex false positive   should not have grabbed "complex"
1310 Paxillus involutus complex false positive   should not have grabbed "complex"
1469 Lycaeides complex false positive   should not have grabbed "complex"
1519 A. tumefaciens confer false positive A. tumefaciens confer the ability to should not have grabbed "confer"
1292 Acropora coral larvae false positive   should not have grabbed "coral larvae"
366 Mytilus data false positive   should not have grabbed "data"
448 P. aztecus data false positive   should not have grabbed "data"
1158 Pocillopora data false positive   should not have grabbed "data"
1293 S. pelliserpentis data false positive   should not have grabbed "data"
1349 D. ferruginea data false positive fit the D. ferruginea data equally well should not have grabbed "data"
684 N. fasciatus distribution false positive   should not have grabbed "distribution"
972 G. clavigera distribution false positive   should not have grabbed "distribution"
1414 Neotyphodium occultans endophyte false positive   should not have grabbed "endophyte"
340 Arthroleptis expanded false positive that Arthroleptis expanded out of should not have grabbed "expanded"
454 S. jarrovii form eight false positive S. jarrovii form eight basal clades should not have grabbed "form eight"
634 R. amarus form host-specific false positive   should not have grabbed "form host-specific"
1414 Neotyphodium fungi false positive   should not have grabbed "fungi"
415 Drosophila genes false positive   should not have grabbed "genes"
432 C. elegans genes false positive   should not have grabbed "genes"
917 Oxphos genes false positive   should not have grabbed "genes"
1441 L. cerasina genes false positive   should not have grabbed "genes"
505 C. ignobilis genetic false positive   should not have grabbed "genetic"
520 Y. pestis genetic false positive   should not have grabbed "genetic"
825 D. mauritiana genetic false positive   should not have grabbed "genetic"
825 D. sechellia genetic false positive   should not have grabbed "genetic"
929 S. cerevisiae genetic false positive   should not have grabbed "genetic"
929 S. paradoxus genetic false positive   should not have grabbed "genetic"
1027 Boechera genetic false positive   should not have grabbed "genetic"
1048 Z. mays genetic false positive   should not have grabbed "genetic"
1093 Sorex genetic false positive   should not have grabbed "genetic"
1310 L. amethystina genetic false positive   should not have grabbed "genetic"
481 D. mojavensis genome false positive   should not have grabbed "genome"
520 R. norvegicus genome false positive   should not have grabbed "genome"
733 S. vulgaris genome false positive   should not have grabbed "genome"
733 A. thaliana genome false positive   should not have grabbed "genome"
825 D. sechellia genome false positive   should not have grabbed "genome"
1234 B. mori genome false positive   should not have grabbed "genome"
1158 S. glynni genotype false positive   should not have grabbed "genotype"
1158 Symbiodinium genotype false positive   should not have grabbed "genotype"
505 Caranx hybrids false positive   should not have grabbed "hybrids"
505 C. melampygus hybrids false positive   should not have grabbed "hybrids"
733 Senecio hybrids false positive   should not have grabbed "hybrids"
733 Spartina hybrids false positive   should not have grabbed "hybrids"
972 Pinus banksiana hybrids false positive   should not have grabbed "hybrids"
972 P. banksiana hybrids false positive   should not have grabbed "hybrids"
1005 C. porosus hybrids false positive   should not have grabbed "hybrids"
1060 Boechera hybrids false positive   should not have grabbed "hybrids"
1414 N. occultans hyphae false positive   should not have grabbed "hyphae"
710 S. cerevisiae isolates false positive   should not have grabbed "isolates"
929 S. cerevisiae isolates false positive   should not have grabbed "isolates"
1234 B. mori larva false positive   should not have grabbed "larva"
457 Annulipalpian larvae false positive   should not have grabbed "larvae"
457 Integripalpian larvae false positive   should not have grabbed "larvae"
457 Spicipalpian larvae false positive   should not have grabbed "larvae"
1207 Drosophila melanogaster larvae false positive   should not have grabbed "larvae"
1292 A. millepora larvae false positive   should not have grabbed "larvae"
1316 Drosophila melanogaster larvae false positive   should not have grabbed "larvae"
1457 H. erythrogramma larvae false positive H. erythrogramma larvae develop should not have grabbed "larvae"
422 Petromarmota lineage false positive   should not have grabbed "lineage"
432 Caenorhabditis lineage false positive   should not have grabbed "lineage"
536 Aphelocoma lineage false positive   should not have grabbed "lineage"
684 L. callipterus lineage false positive   should not have grabbed "lineage"
972 G. clavigera lineage false positive   should not have grabbed "lineage"
972 Grosmannia clavigera lineage false positive   should not have grabbed "lineage"
536 Aphelocoma lineages plus false positive   should not have grabbed "lineages plus"
405 Trabeculus lineages false positive   should not have grabbed "lineages"
536 Aphelocoma lineages false positive   should not have grabbed "lineages"
684 L. callipterus lineages false positive   should not have grabbed "lineages"
884 Pogonomyrmex lineages false positive   should not have grabbed "lineages"
999 P. antipodarum lineages false positive   should not have grabbed "lineages"
1060 Boechera lineages false positive   should not have grabbed "lineages"
1332 Rhinella margaritifera lineages false positive   should not have grabbed "lineages"
1332 Al. femoralis lineages false positive   should not have grabbed "lineages"
1332 Ad. heyeri lineages false positive   should not have grabbed "lineages"
1359 Oxalis lineages false positive   should not have grabbed "lineages"
1519 A. tumefaciens lineages false positive A. tumefaciens lineages compete for a should not have grabbed "lineages"
946 X. largeni loci false positive   should not have grabbed "loci"
366 M. edulis locus false positive for any M edulis locus studied should not have grabbed "locus"
825 D. sechellia mate false positive   should not have grabbed "mate"
340 A. sylvaticus may false positive While A. sylvaticus may comprise should not have grabbed "may"
340 Arthroleptis may false positive within Arthroleptis may aid should not have grabbed "may"
356 N. yunnanensis may false positive that N. yunnanensis may be a specie complex should not have grabbed "may"
405 Procellaria westlandica may false positive   should not have grabbed "may"
405 Pu. huttoni may false positive   should not have grabbed "may"
422 Petromarmota may false positive   should not have grabbed "may"
432 C. elegans may false positive   should not have grabbed "may"
481 D. mojavensis may false positive   should not have grabbed "may"
505 C. melampygus may false positive   should not have grabbed "may"
726 A. taeniorhynchus may false positive   should not have grabbed "may"
923 Neofelis may false positive   should not have grabbed "may"
929 S. cerevisiae may false positive   should not have grabbed "may"
1005 Crocodylus may false positive   should not have grabbed "may"
1005 C. porosus may false positive   should not have grabbed "may"
1070 Aphelocoma californica may false positive   should not have grabbed "may"
1070 A. alexandri may false positive   should not have grabbed "may"
1070 P. edulis may false positive   should not have grabbed "may"
1234 B. mori may false positive   should not have grabbed "may"
1310 L. amethystina may false positive   should not have grabbed "may"
1371 M. arion may false positive   should not have grabbed "may"
1398 S. alterniflora may false positive   should not have grabbed "may"
1441 Laupala may false positive across Laupala may have should not have grabbed "may"
1457 H. erythrogramma may false positive   should not have grabbed "may"
1158 Symbiodinium may false positive   should not have grabbed "may"- related to false negative?
1087 Rhizoglyphus means false positive   should not have grabbed "means"
1191 Varroa mite false positive   should not have grabbed "mite"
1076 M. strigosum origin false positive   should not have grabbed "origin"
422 Perhaps broweri false positive   should not have grabbed "Perhaps"
457 Wormaldia plus false positive   should not have grabbed "plus"
700 M. huetii plus false positive   should not have grabbed "plus"
1109 Salmo trutta pose false positive   should not have grabbed "pose"
383 Macroscelides proboscideus sequence false positive   should not have grabbed "sequence"
422 Marmota sequence false positive   should not have grabbed "sequence"
438 Scaptomyza sequence false positive   should not have grabbed "sequence"
487 Tenebrio molitor sequence false positive   should not have grabbed "sequence"
743 Sellaphora sequence false positive   should not have grabbed "sequence"
1480 Ap. longa sequence false positive an Ap. longa sequence in public should not have grabbed "sequence"
1441 L. eukolea sire false positive   should not have grabbed "sire"
1441 L. cerasina sire false positive   should not have grabbed "sire"
634 A. cygnea site false positive   should not have grabbed "site"
1332 R. dapsilis specimen false positive   should not have grabbed "specimen"
432 C. elegans strain false positive   should not have grabbed "strain"
432 C. briggsae strain false positive   should not have grabbed "strain"
497 X. tropicalis strain false positive   should not have grabbed "strain"
798 H. armigera strain false positive   should not have grabbed "strain"
929 S. cerevisiae strain false positive   should not have grabbed "strain"
929 S. paradoxus strain false positive   should not have grabbed "strain"
1519 A. tumefaciens strain false positive A. tumefaciens strain 15955 should not have grabbed "strain"
1519 Agrobacterium tumefaciens strain false positive Agrobacterium tumefaciens strain 15955 should not have grabbed "strain"
514 Peromyscus studies false positive   should not have grabbed "studies"
1457 Heliocidaris subspecies false positive   should not have grabbed "subspecies"
1457 H. erythrogramma subspecies false positive   should not have grabbed "subspecies"
438 Antopocerus taxa false positive   should not have grabbed "taxa"
536 Aphelocoma taxa false positive   should not have grabbed "taxa"
612 Vitis taxa false positive   should not have grabbed "taxa"
1076 Melampodium taxa false positive   should not have grabbed "taxa"
743 Aulacoseira sequence types false positive   should not have grabbed "types"
464 Drosophila melanogaster used false positive   should not have grabbed "used"
481 D. mojavensis used false positive   should not have grabbed "used"
497 X. tropicalis used false positive   should not have grabbed "used"
684 Cytb genes used false positive   should not have grabbed "used"
1005 C. novaeguineae used false positive   should not have grabbed "used"
1228 R. sylvatica used false positive   should not have grabbed "used"
1044 Cameraria versus false positive   should not have grabbed "versus"
1044 Cameraria ohridella versus false positive   should not have grabbed "versus"
1076 M. nayaritense versus false positive   should not have grabbed "versus"
1463 Lerista bougainvilli via false positive   should not have grabbed "via" - table and line break problem
933 T. grallator volcano false positive   should not have grabbed "volcano"
956 Benthic foraminifera false positive   should not have grabbed benthic
956 For foraminifera false positive   should not have grabbed for
313 T. hudsonicus form distinct false positive and T. hudsonicus form distinct genetic grabbed too much
186 L. vilgalysii sequence false positive   grabbing extra words
186 Lepidostroma vilgalysii specimen false positive   grabbing extra words
252 R. antirhini complex false positive within the R. antirhini complex does interpreted R. antirhini complex as name
192 Haloferax volcanii false negative   picked up extra word
192 Haloferax volcanii strain false positive   picked up extra word
202 Chlamydomonas used false positive   picked up extra word
209 Aedes aegypti strain false positive   picked up extra word
209 Sclerotinia sclerotiorum strain false positive   picked up extra word
209 Escherichia coli strain false positive   picked up extra word
209 C. cinerea strain false positive   picked up extra word
209 C. cinerea young false positive   picked up extra word
209 C. elegans larvae false positive   picked up extra word
209 E. coli strain false positive   picked up extra word
209 A. aegypti larvae false positive   picked up extra word
209 A. gossypii strain false positive   picked up extra word
215 Ostomya lineages false positive   picked up extra word
215 Anticorbula may false positive   picked up extra word
215 Selection??????mya arenaria false positive   picked up extra word
215 Pachydon grade false positive   picked up extra word
215 Pachydon may false positive   picked up extra word
215 A. mencheri present false positive   picked up extra word
182 Mimulus false negative   picked up the "hybrids"
1310 T. scalpturatum complex false positive   related to false negative? should not have grabbed "complex"
110 Aulacoseira clade false positive   used many times in text
110 Navicula clade false positive   used many times in text
137 Oikopleura genome false positive   used many times in text
dimus commented 3 years ago

I will check these words if they can be placed in rejection dictionary.

https://github.com/gnames/gnfinder/issues/64

dimus commented 3 years ago

We can regect these as species epithets:

after
alliance
candidate
change
clade
clades
collection
complex
confer
coral
distinct
distribution
eight
endophyte
expanded
fungi
genes
genetic
genome
genotype
grade
haplotype
host-specific
hybrids
isolates
larva
larvae
lineage
lineages
map
origin
sequence
site
specific
specimen
strain
studies
subspecies
taxa
type
types
used

We can regect these as uninomials:

american
asiatic
both
perhaps
benthic
for
selection

These are ambiguous as species, so they can only be figured out on the verification stage.

may
data
hyphae
locus
loci
plus
mate
means
mite
pose
sire
via
versus
volcano
volcanii
young
present