Non latin NPL citations mess up the npl_class

cverluise commented 4 years ago

Due to the limited abilities of the labelers (including me), the classification model was trained only on English (and some other Latin-based languages) examples. Hence, citations based on non Latin mess up the classification out of sample.

Proposal

Add a language detection pipeline. E.g. spaCy-langdetect or spaCy-cld and exclude non english citations/ create a specific subset

cverluise commented 4 years ago

Implemented in patcit@nightly using pycld2 (based on CLD2 which is itself derived for the chromium compact language detector project)

Note for dev: I chose CLD2 rather than CLD3 because CLD2 guarantees text preprocessing (such as url cleaning etc) while CLD3 does not which can cause strange errors.

NPL citations language (top 100)

```sql SELECT COUNT(npl_publn_id) AS nb, LANGUAGE FROM `npl-parsing.external.npl_language` GROUP BY LANGUAGE ORDER BY nb DESC ``` Row | nb | LANGUAGE | -- | -- | -- | -- 1 | 34768364 | ENGLISH | 2 | 1944290 | Unknown | 3 | 1533901 | Chinese | 4 | 347973 | GERMAN | 5 | 177638 | Japanese | 6 | 98810 | DANISH | 7 | 72890 | FRENCH | 8 | 30161 | LATIN | 9 | 27244 | LUXEMBOURGISH | 10 | 21236 | Korean | 11 | 19275 | RUSSIAN | 12 | 9886 | DUTCH | 13 | 5631 | NORWEGIAN | 14 | 5598 | POLISH | 15 | 5141 | ChineseT | 16 | 4544 | PORTUGUESE | 17 | 4454 | SPANISH | 18 | 4301 | ITALIAN | 19 | 3505 | INTERLINGUE | 20 | 3109 | NORWEGIAN_N | 21 | 1786 | INDONESIAN | 22 | 1501 | SCOTS | 23 | 1307 | CZECH | 24 | 1297 | INTERLINGUA | 25 | 1287 | FRISIAN | 26 | 1276 | SWEDISH | 27 | 1211 | KHASI | 28 | 1210 | RHAETO_ROMANCE | 29 | 1105 | JAVANESE | 30 | 1079 | AFAR | 31 | 1078 | MALAGASY | 32 | 1022 | HAUSA | 33 | 1012 | CATALAN | 34 | 922 | CORSICAN | 35 | 783 | GALICIAN | 36 | 779 | VOLAPUK | 37 | 775 | SANSKRIT | 38 | 774 | SCOTS_GAELIC | 39 | 743 | AFRIKAANS | 40 | 676 | GREEK | 41 | 650 | FINNISH | 42 | 640 | ROMANIAN | 43 | 640 | SLOVAK | 44 | 590 | WARAY_PHILIPPINES | 45 | 540 | MANX | 46 | 522 | HUNGARIAN | 47 | 488 | X_PIG_LATIN | 48 | 480 | SERBIAN | 49 | 468 | LITHUANIAN | 50 | 467 | TATAR | 51 | 442 | NAURU | 52 | 440 | CEBUANO | 53 | 435 | MALAY | 54 | 431 | BASQUE | 55 | 427 | HAITIAN_CREOLE | 56 | 426 | OCCITAN | 57 | 412 | ESTONIAN | 58 | 411 | BRETON | 59 | 408 | GUARANI | 60 | 408 | TAGALOG | 61 | 390 | UZBEK | 62 | 367 | SESELWA | 63 | 354 | VIETNAMESE | 64 | 336 | WOLOF | 65 | 323 | KINYARWANDA | 66 | 311 | X_KLINGON | 67 | 301 | MAURITIAN_CREOLE | 68 | 298 | SLOVENIAN | 69 | 289 | ESPERANTO | 70 | 284 | WELSH | 71 | 271 | LINGALA | 72 | 270 | XHOSA | 73 | 254 | CROATIAN | 74 | 243 | TURKISH | 75 | 238 | BISLAMA | 76 | 219 | SHONA | 77 | 214 | RUNDI | 78 | 205 | TSWANA | 79 | 188 | SAMOAN | 80 | 179 | FAROESE | 81 | 174 | ALBANIAN | 82 | 164 | NYANJA | 83 | 162 | SWAHILI | 84 | 158 | LATVIAN | 85 | 157 | SUNDANESE | 86 | 156 | IRISH | 87 | 156 | HAWAIIAN | 88 | 153 | SESOTHO | 89 | 145 | SOMALI | 90 | 138 | ZHUANG | 91 | 135 | TURKMEN | 92 | 132 | GANDA | 93 | 130 | MALTESE | 94 | 121 | FIJIAN | 95 | 108 | TONGA | 96 | 108 | TSONGA | 97 | 105 | OROMO | 98 | 86 | ICELANDIC | 99 | 77 | AKAN | 100 | 75 | GREENLANDIC

Unknown seems to be mainly very short npl, in particular bibliographical references with many abbreviations -> they should be kept

Sample of `Unknown`

```sql WITH tmp AS ( SELECT npl_publn_id FROM `npl-parsing.external.npl_language` WHERE LANGUAGE="Unknown") SELECT npl_biblio FROM `usptobias.patstat.tls214` AS npl, tmp WHERE tmp.npl_publn_id = npl.npl_publn_id AND rand()<100/1900000 ``` Row | npl_biblio | -- | -- | -- 1 | MicroVit Vitrectomy System , Copyright 1983. | 2 | JP 2001-095573 | 3 | JPN6013032839; Ｊ．　Ｄａｉｒｙ　Ｓｃｉ．，　２００１，　Ｖｏｌ．８４，　Ｎｏ．２，　ｐｐ．３１９-３３１ | 4 | Mac Tool Catalog (1997), p. 17. | 5 | Oppolzer, Tetrahedron Lett. No. 12, pp. 1001 1004 (1974). | 6 | Derwent-Ref. 84-056356/10 | 7 | Kishio et al., Jpn. J. Appl. Phys. (1987) 26:L1228. | 8 | Georges et al., Macromolecules 1994, 27, 7228. | 9 | Kalmbach et al (2007 JMB 371:639-48). | 10 | Laser Pegs 2012 Catalog. | 11 | Ueda et al., CA, 106, 1987, 79659k. | 12 | JPN7015002669; Ｆｅｎｇ，　Ｇｕｏ-Ｌｉａｎｇ；　Ｊｉ，　Ｓｈｕｎ-Ｊｕｎ；　Ｌａｉ，　Ｗｅｎ-Ｙｏｎｇ；　Ｈｕａｎｇ，　Ｗｅｉ: 'Ｓｙｎｔｈｅｓｉｓ　ａｎｄ　ｏｐｔｉｃａｌ　ｐｒｏｐｅｒｔｉｅｓ　ｏｆ　ｓｔａｒｂｕｒｓｔ　ｃａｒｂａｚｏｌｅｓ　ｂａｓｅｄ　ｏｎ　９-ｐｈｅｎｙｌｃａｒｂａｚｏｌｅ　ｃｏｒｅ' Ｓｙｎｌｅｔｔ（１７）, 2006, ２８４１-２８４５ | 13 | WO 88/00617 | 14 | Hannun et al., J. Biol. Chem. 262: 13620, 1987. | 15 | AKIRI ET AL., ONCOGENE, vol. 28, 2009, pages 2163 - 2172 | 16 | SUGAWARE M. ET AL.: 'pH Kanjusei Maku Yugo Liposome Lipoplex Fukugotai ni yoru Idenshi Delivery: Ca Ion Doji Donyu ni yoru Idenshi Donyu Koka no Zokyo', DRUG DELIVERY SYSTEM, vol. 17, no. 3, 2002, pages 272, II-O-13, XP003016762 | 17 | Morrison & Boyd, Chapter 22, Organic Chemistry, 3rd Ed. (1973). | 18 | Cheng, et al., Tetrahedron Lett., 32(49), 7333 7336 (1991). | 19 | JPN6012043393; Ｔｉｍ　Ｏｌｓｏｎ，　Ｂｏｂ　Ｏ'Ｈａｒａ，　Ｅｍｉｌｙ　Ｈ．　Ｑｉ，　Ｎｅｃａｔｉ　Ｃａｎｐｏｌａｔ，　Ｓｉｍｏｎ　Ｂｌａｃｋ，　Ｊａｒｉ　Ｊｏｋｅｌａ: 'Ｎｏｒｍａｔｉｖｅ　Ｔｅｘｔ　Ｐｒｏｐｏｓａｌ　ｆｏｒ　Ｄｉａｇｎｏｓｔｉｃｓ　ａｎｄ　Ｔｒｏｕｂｌｅｓｈｏｏｔｉｎｇ' ＩＥＥＥ　８０２．１１-０５／１０７０ｒ２ , 20060111, ｐａｒａｇｒａｐｈ　７．３，２１．１３, ＩＥＥＥ　ｍｅｎｔｏｒ | 20 | Yayon et al. 1991. Cell 64:841. | 21 | JPN6013054674; Ｚｉｎｎｅｒ　Ｈ　ｅｔ　ａｌ: Ｊｏｕｒｎａｌ　ｆｕｅｒ　Ｐｒａｋｔｉｓｃｈｅ　ＣｈｅｍｉｅＶｏｌ．３１７, 1975, ｐ．３７９-８６ | 22 | DE-Z: 'ntz' Heft 13, 1984, S. 175-176 | 23 | Poulos, et al, GenBank No. AAT67231.1 2006. | 24 | Presnov, M.A., et al., 'Antitumor properties of cis-dichlorodiamminedihydroxyplatinum(IV)', Izvestiya Akademii Nauk SSSR, Seriya Biologicheskaya (1986), (3), pp. 417-428, 1986. | 25 | Hartlage-Rubsamen et al., Glia 41(2) 169-179 (Dec. 28, 2002). | 26 | KOSHKIN ET AL., TETRAHEDRON, vol. 54, 1998, pages 3607 - 3630 | 27 | Dubreuil et al., Endocrinology (1989) 125(3):1378 1384. | 28 | J. Kresta, R. Chang, S. Kathiriya and K. Frisch, Makromol Chemie , 180, p. 1081 (1979). | 29 | Schilmiller et al, 2009, PNAS, 106:10865-10870, see pp. 10866-10867. | 30 | BIOORGANIC & MEDICINAL CHEMISTRY LETTERS, vol. 15, no. 1, 2005, pages 231 - 234 | 31 | VAN DIJK; VAN DE WINKEL, CURR. OPIN. PHARMACOL., vol. 5, 2001, pages 368 - 74 | 32 | PEYRAUD J. L.; ROUILLÉ B.; HURTAUD C.; BRUNSCHWIG P.: 'Les acides gras du lait de vache - Collection Synthèse', 2011, article 'La modulation du profil en acides gras des laits par l'alimentation', pages: 13 - 28 | 33 | 肖刚等: '《大能源分布式能源》', 30 September 2015 | 34 | Lettau, Chemie der Heterocyclen, p. 17-27, 1st edition, VEB, Weinheim (1979). | 35 | Diamond 2001 | 36 | 康文甲: '《管道工》', 31 December 1989, article '冷凝器', pages: 604 | 37 | Gillessen, S. et al., Mouse interleukin 12 (IL 12) p40 homodimer: a potent IL 12 antagonist Eur. J. Immunol. 25:200 206 (1995). | 38 | 梁金钟等: '微生物发酵法合成高分子聚合物γ-PGA的研究', 《北京工商大学学报(自然科学版)》 | 39 | Neurosci. Ltrs 188(1995)41-44,Daidson et al. | 40 | Murphy et al., J. Biol. Chem. 269, 6632-6636 (1994). | 41 | REICH ET AL., MOL. VISION., vol. 9, 2003, pages 210 - 216 | 42 | McClean et al, 1993, Eur J Cancer, 29A: 2243-2248.* | 43 | Database Uniprot, 'Interleukin-17 receptor B precursor (IL-17 receptor B) (IL-17RB) (Interleukin-17B receptor) (IL-17B receptor) (IL-17 receptor homolog 1) (IL-17Rh1) (IL17Rh1) (Cytokine receptor CRL4)', Accession No. Q9NRM6, May 27, 2002. | 44 | Kretzschmar, E. et al., 'Synthese von 2,6-disubstituierten 4-Hydroxy-5,6,7,8-tetrahydropyrido[4,3-d]pyrimidinen', Pharmazie, 43(7), 475-476 (1988). | 45 | DE-Firmenprospekt, Flying Kajakat, 1987 | 46 | JP Office Action dtd Sep. 2, 2008, JP Appln. 2007-021773. | 47 | Crainich, L. ‘Forming a 90 deg Bend’ Metal Forming Magazine (1991) vol. 25, No. 8 pp. 59-60. | 48 | JPN6015011443; Ｊｏｕｒｎａｌ　ｏｆ　Ｅｘｐｅｒｉｍｅｎｔａｌ　ＭｅｄｉｃｉｎｅＶｏｌ．２０５，Ｎｏ．２, 2008, ｐ２８７-２９４ | 49 | Cordoba, J. and B. Minguez (2008) “Hepatic Encephalopathy” Semin Liver Dis, 28(1):70-80. | 50 | LU, X .; YU, M .; WANG, G .; ZHAI T .; XIE, S .; LING , Y .; TONG, Y .; LI, Y., ADV. MATER., vol. 25, 2013, pages 267 - 272 | 51 | Kluting, Flierl, Grudno and Luttermann; MTZ Magazine, Aug. 1999, 'Drosselfreie Laststeuerung miy vollvariablen Ventiltrieben'. | 52 | DE-Z.: Korrespondenz Abwasser 38(1991), S. 228-34 | 53 | U.S. Appl. No. 13/608,744. | 54 | JPN6013021469; ＭＡＡＬＥＪ　Ｎ　ｅｔ　ａｌ: 'Ａｎｔｉｔｈｒｏｍｂｏｔｉｃ　Ｅｆｆｅｃｔ　ｏｆ　Ｆｌａｖｏｎｏｉｄｓ　ｉｎ　Ｒｅｄ　Ｗｉｎｅ' ＡＣＳ　Ｓｙｍｐ　ＳｅｒＮｏ．６６１, 1997, Ｐａｇｅ．２４７-２６０ | 55 | Dixon et al., Ann. Rev. Pharmacol. Toxicol., 1980, p. 441-462, 20. | 56 | Albery et al., Amperometric enzyme electrodes , Phil. Trans. R. Soc. Long., vol. B 316, pp. 107 119 (1987). | 57 | Kniskern, P. J. et al., Gene 46, 135 (1986) (Kniskern I). | 58 | Prospekt, VVS-Isolering der Fa. Gullfiber, 1979 | 59 | Crosslinking Polymer CA 81(24):153514t Kajiyama et al. Feb. 1970. | 60 | M. J. GROGAN; M. R. PRATT; L. A. MARCAURELLE; C. R. BERTOZZI, ANNU. REV. BIOCHEM., vol. 71, 2002, pages 593 - 634 | 61 | U.S. Appl. No. 11/090,432. | 62 | SAMBROOK, J.; RUSSELL, D. W.: 'Molecular Cloning: a Laboratory Manual', 2001, COLD SPRING HARBOR LABORATORY | 63 | BiliBed® Phototherapy System, Medela AG, http://www.medela.com/ISBD/neonatology/bilibed/index.php, 6 pages, 2008. | 64 | JPN7011004201; Ｊ．　Ｎａｔｌ．　Ｃａｎｃｅｒ　Ｉｎｓｔ．　（１９９７）　ｖｏｌ．８９，　ｎｏ．４，　ｐ．２９３-３００ | 65 | GUSTAFSSON ET AL., N ENGL. J. MED., vol. 334, 1996, pages 349 - 355 | 66 | Sommer-Knudsen, J. et al., Hydroxyproline-Rich Plant Glycoproteins, Phytochemistry, 1998, 47(4): 483-497. | 67 | Kaiser, Amino Acids 2012, 42, 679-684 | 68 | CA113(8): 68388q, 1989. | 69 | Honée, G., Convents, D., Van Rie, J., Jansens, S., Peferoen, M., Visser, B. The C-terminal domain of the toxic fragment of a Bacillus thuringiensis crystal protein determines receptor binding. (1991) Mol. Microbiol. 5:2799-2806. | 70 | Zhang et al., Acta Pharmacol. Sinica 27(2): 179-183 (2006). | 71 | Franz et al., (1980) Pflugeos arch., p. R2. | 72 | 王建新: '《化妆品植物原料大全》', 30 June 2012 | 73 | Hall et al., Carcinogenesis 2000; 21: 53-60. | 74 | Lahourcade, Lise , et al., 'Molecular beam epitaxy of semipolar AlN(1122) and GaN(1122) on m-sapphire', J Mater Sci: Mater Electron, No. 19, (2008), pp. 805-809. | 75 | JPN6012063322; ＪＥＴＩＶｏｌ．５５，　Ｎｏ．１３, 2007, ｐ．３５-３７ | 76 | Norm DIN EN 14604 | 77 | M. Aldissi et al., Polymer, vol. 23, pp. 243 245, (1982). | 78 | XP002900204 | 79 | JPN6012065635; Ｕｓｈａ　Ｒ　Ｄｅｓｈｐａｎｄｅ　ｅｔ　ａｌ: Ｉｎｄｉａｎ　Ｊｏｕｒｎａｌ　ｏｆ　Ｅｘｐｅｒｉｍｅｎｔａｌ　Ｂｉｏｌｏｇｙ３６（６）, 1998, ｐ．５７３-５７７ | 80 | Bowie et al. (1990) Science 247 : 1306-1310. | 81 | ALTSCHUL ET AL., J. MOL. BIOL., vol. 215, 1990, pages 403 - 410 | 82 | SP 103 bulletin. | 83 | Liu, et.al., '99mTc-Labeling of a Hydrazinonicotinamide-Conjugated Vitronectin Receptor Antagonist Useful For Imaging Tumors' Bioconjugate Chem. 2001, 12, 623-629. | 84 | Pereira et al. Polymorphism of Human Cytomegalovirus Glycoproteins Characterized by Monoclonal Antibodies Virology (1984) 139:73 86. | 85 | Thompson, J.Virol. 61: 229 232 (1987). | 86 | B. Kumar and J. Kumar, J. Electrochem. Soc., 2010, 157, A611. | 87 | Rauvala et al., Biochim. Biophys. Acta 531: 266 274, 1978. | 88 | Carvajal et al., J. Vet. Diagn. Invest., 7:60-64, (1995). | 89 | Okabe, et al. J. Org. Chem. 56:4392 (1991). | 90 | ORGANIC LETTERS, 2000, pages 1749 - 51 | 91 | Einde et al., JFS, 2003, Vol. 68, No. 8, p. 2396-2404. | 92 | DIN 3223

cverluise commented 3 years ago

Addressed in v03 🎉 . The npl_cat classifier was trained on examples in english (and unknown) only. A npl_cat_flag bool was added to the v03. npl_cat_flag:

if lang in ['en', 'un'], false
else, true Ideally, one should restrict to npl_cat_flag=True. Closing this issue, feel free to reopen.

cverluise / PatCit

Non latin NPL citations mess up the npl_class #33

Proposal