cverluise / PatCit

Making Patent Citations Uncool Again
MIT License
108 stars 13 forks source link

Non latin NPL citations mess up the npl_class #33

Closed cverluise closed 3 years ago

cverluise commented 4 years ago

Due to the limited abilities of the labelers (including me), the classification model was trained only on English (and some other Latin-based languages) examples. Hence, citations based on non Latin mess up the classification out of sample.


Add a language detection pipeline. E.g. spaCy-langdetect or spaCy-cld and exclude non english citations/ create a specific subset

cverluise commented 4 years ago

Implemented in patcit@nightly using pycld2 (based on CLD2 which is itself derived for the chromium compact language detector project)

Note for dev: I chose CLD2 rather than CLD3 because CLD2 guarantees text preprocessing (such as url cleaning etc) while CLD3 does not which can cause strange errors.

NPL citations language (top 100) ```sql SELECT COUNT(npl_publn_id) AS nb, LANGUAGE FROM `npl-parsing.external.npl_language` GROUP BY LANGUAGE ORDER BY nb DESC ``` Row | nb | LANGUAGE |   -- | -- | -- | -- 1 | 34768364 | ENGLISH |   2 | 1944290 | Unknown |   3 | 1533901 | Chinese |   4 | 347973 | GERMAN |   5 | 177638 | Japanese |   6 | 98810 | DANISH |   7 | 72890 | FRENCH |   8 | 30161 | LATIN |   9 | 27244 | LUXEMBOURGISH |   10 | 21236 | Korean |   11 | 19275 | RUSSIAN |   12 | 9886 | DUTCH |   13 | 5631 | NORWEGIAN |   14 | 5598 | POLISH |   15 | 5141 | ChineseT |   16 | 4544 | PORTUGUESE |   17 | 4454 | SPANISH |   18 | 4301 | ITALIAN |   19 | 3505 | INTERLINGUE |   20 | 3109 | NORWEGIAN_N |   21 | 1786 | INDONESIAN |   22 | 1501 | SCOTS |   23 | 1307 | CZECH |   24 | 1297 | INTERLINGUA |   25 | 1287 | FRISIAN |   26 | 1276 | SWEDISH |   27 | 1211 | KHASI |   28 | 1210 | RHAETO_ROMANCE |   29 | 1105 | JAVANESE |   30 | 1079 | AFAR |   31 | 1078 | MALAGASY |   32 | 1022 | HAUSA |   33 | 1012 | CATALAN |   34 | 922 | CORSICAN |   35 | 783 | GALICIAN |   36 | 779 | VOLAPUK |   37 | 775 | SANSKRIT |   38 | 774 | SCOTS_GAELIC |   39 | 743 | AFRIKAANS |   40 | 676 | GREEK |   41 | 650 | FINNISH |   42 | 640 | ROMANIAN |   43 | 640 | SLOVAK |   44 | 590 | WARAY_PHILIPPINES |   45 | 540 | MANX |   46 | 522 | HUNGARIAN |   47 | 488 | X_PIG_LATIN |   48 | 480 | SERBIAN |   49 | 468 | LITHUANIAN |   50 | 467 | TATAR |   51 | 442 | NAURU |   52 | 440 | CEBUANO |   53 | 435 | MALAY |   54 | 431 | BASQUE |   55 | 427 | HAITIAN_CREOLE |   56 | 426 | OCCITAN |   57 | 412 | ESTONIAN |   58 | 411 | BRETON |   59 | 408 | GUARANI |   60 | 408 | TAGALOG |   61 | 390 | UZBEK |   62 | 367 | SESELWA |   63 | 354 | VIETNAMESE |   64 | 336 | WOLOF |   65 | 323 | KINYARWANDA |   66 | 311 | X_KLINGON |   67 | 301 | MAURITIAN_CREOLE |   68 | 298 | SLOVENIAN |   69 | 289 | ESPERANTO |   70 | 284 | WELSH |   71 | 271 | LINGALA |   72 | 270 | XHOSA |   73 | 254 | CROATIAN |   74 | 243 | TURKISH |   75 | 238 | BISLAMA |   76 | 219 | SHONA |   77 | 214 | RUNDI |   78 | 205 | TSWANA |   79 | 188 | SAMOAN |   80 | 179 | FAROESE |   81 | 174 | ALBANIAN |   82 | 164 | NYANJA |   83 | 162 | SWAHILI |   84 | 158 | LATVIAN |   85 | 157 | SUNDANESE |   86 | 156 | IRISH |   87 | 156 | HAWAIIAN |   88 | 153 | SESOTHO |   89 | 145 | SOMALI |   90 | 138 | ZHUANG |   91 | 135 | TURKMEN |   92 | 132 | GANDA |   93 | 130 | MALTESE |   94 | 121 | FIJIAN |   95 | 108 | TONGA |   96 | 108 | TSONGA |   97 | 105 | OROMO |   98 | 86 | ICELANDIC |   99 | 77 | AKAN |   100 | 75 | GREENLANDIC

Unknown seems to be mainly very short npl, in particular bibliographical references with many abbreviations -> they should be kept

Sample of `Unknown` ```sql WITH tmp AS ( SELECT npl_publn_id FROM `npl-parsing.external.npl_language` WHERE LANGUAGE="Unknown") SELECT npl_biblio FROM `usptobias.patstat.tls214` AS npl, tmp WHERE tmp.npl_publn_id = npl.npl_publn_id AND rand()<100/1900000 ``` Row | npl_biblio |   -- | -- | -- 1 | MicroVit Vitrectomy System , Copyright 1983. |   2 | JP 2001-095573 |   3 | JPN6013032839; J. Dairy Sci., 2001, Vol.84, No.2, pp.319-331 |   4 | Mac Tool Catalog (1997), p. 17. |   5 | Oppolzer, Tetrahedron Lett. No. 12, pp. 1001 1004 (1974). |   6 | Derwent-Ref. 84-056356/10 |   7 | Kishio et al., Jpn. J. Appl. Phys. (1987) 26:L1228. |   8 | Georges et al., Macromolecules 1994, 27, 7228. |   9 | Kalmbach et al (2007 JMB 371:639-48). |   10 | Laser Pegs 2012 Catalog. |   11 | Ueda et al., CA, 106, 1987, 79659k. |   12 | JPN7015002669; Feng, Guo-Liang; Ji, Shun-Jun; Lai, Wen-Yong; Huang, Wei: 'Synthesis and optical properties of starburst carbazoles based on 9-phenylcarbazole core' Synlett (17), 2006, 2841-2845 |   13 | WO 88/00617 |   14 | Hannun et al., J. Biol. Chem. 262: 13620, 1987. |   15 | AKIRI ET AL., ONCOGENE, vol. 28, 2009, pages 2163 - 2172 |   16 | SUGAWARE M. ET AL.: 'pH Kanjusei Maku Yugo Liposome Lipoplex Fukugotai ni yoru Idenshi Delivery: Ca Ion Doji Donyu ni yoru Idenshi Donyu Koka no Zokyo', DRUG DELIVERY SYSTEM, vol. 17, no. 3, 2002, pages 272, II-O-13, XP003016762 |   17 | Morrison & Boyd, Chapter 22, Organic Chemistry, 3rd Ed. (1973). |   18 | Cheng, et al., Tetrahedron Lett., 32(49), 7333 7336 (1991). |   19 | JPN6012043393; Tim Olson, Bob O'Hara, Emily H. Qi, Necati Canpolat, Simon Black, Jari Jokela: 'Normative Text Proposal for Diagnostics and Troubleshooting' IEEE 802.11-05/1070r2 , 20060111, paragraph 7.3,21.13, IEEE mentor |   20 | Yayon et al. 1991. Cell 64:841. |   21 | JPN6013054674; Zinner H et al: Journal fuer Praktische Chemie Vol.317, 1975, p.379-86 |   22 | DE-Z: 'ntz' Heft 13, 1984, S. 175-176 |   23 | Poulos, et al, GenBank No. AAT67231.1 2006. |   24 | Presnov, M.A., et al., 'Antitumor properties of cis-dichlorodiamminedihydroxyplatinum(IV)', Izvestiya Akademii Nauk SSSR, Seriya Biologicheskaya (1986), (3), pp. 417-428, 1986. |   25 | Hartlage-Rubsamen et al., Glia 41(2) 169-179 (Dec. 28, 2002). |   26 | KOSHKIN ET AL., TETRAHEDRON, vol. 54, 1998, pages 3607 - 3630 |   27 | Dubreuil et al., Endocrinology (1989) 125(3):1378 1384. |   28 | J. Kresta, R. Chang, S. Kathiriya and K. Frisch, Makromol Chemie , 180, p. 1081 (1979). |   29 | Schilmiller et al, 2009, PNAS, 106:10865-10870, see pp. 10866-10867. |   30 | BIOORGANIC & MEDICINAL CHEMISTRY LETTERS, vol. 15, no. 1, 2005, pages 231 - 234 |   31 | VAN DIJK; VAN DE WINKEL, CURR. OPIN. PHARMACOL., vol. 5, 2001, pages 368 - 74 |   32 | PEYRAUD J. L.; ROUILLÉ B.; HURTAUD C.; BRUNSCHWIG P.: 'Les acides gras du lait de vache - Collection Synthèse', 2011, article 'La modulation du profil en acides gras des laits par l'alimentation', pages: 13 - 28 |   33 | 肖刚等: '《大能源 分布式能源》', 30 September 2015 |   34 | Lettau, Chemie der Heterocyclen, p. 17-27, 1st edition, VEB, Weinheim (1979). |   35 | Diamond 2001 |   36 | 康文甲: '《管道工》', 31 December 1989, article '冷凝器', pages: 604 |   37 | Gillessen, S. et al., Mouse interleukin 12 (IL 12) p40 homodimer: a potent IL 12 antagonist Eur. J. Immunol. 25:200 206 (1995). |   38 | 梁金钟等: '微生物发酵法合成高分子聚合物γ-PGA的研究', 《北京工商大学学报(自然科学版)》 |   39 | Neurosci. Ltrs 188(1995)41-44,Daidson et al. |   40 | Murphy et al., J. Biol. Chem. 269, 6632-6636 (1994). |   41 | REICH ET AL., MOL. VISION., vol. 9, 2003, pages 210 - 216 |   42 | McClean et al, 1993, Eur J Cancer, 29A: 2243-2248.* |   43 | Database Uniprot, 'Interleukin-17 receptor B precursor (IL-17 receptor B) (IL-17RB) (Interleukin-17B receptor) (IL-17B receptor) (IL-17 receptor homolog 1) (IL-17Rh1) (IL17Rh1) (Cytokine receptor CRL4)', Accession No. Q9NRM6, May 27, 2002. |   44 | Kretzschmar, E. et al., 'Synthese von 2,6-disubstituierten 4-Hydroxy-5,6,7,8-tetrahydropyrido[4,3-d]pyrimidinen', Pharmazie, 43(7), 475-476 (1988). |   45 | DE-Firmenprospekt, Flying Kajakat, 1987 |   46 | JP Office Action dtd Sep. 2, 2008, JP Appln. 2007-021773. |   47 | Crainich, L. ‘Forming a 90 deg Bend’ Metal Forming Magazine (1991) vol. 25, No. 8 pp. 59-60. |   48 | JPN6015011443; Journal of Experimental Medicine Vol.205,No.2, 2008, p287-294 |   49 | Cordoba, J. and B. Minguez (2008) “Hepatic Encephalopathy” Semin Liver Dis, 28(1):70-80. |   50 | LU, X .; YU, M .; WANG, G .; ZHAI T .; XIE, S .; LING , Y .; TONG, Y .; LI, Y., ADV. MATER., vol. 25, 2013, pages 267 - 272 |   51 | Kluting, Flierl, Grudno and Luttermann; MTZ Magazine, Aug. 1999, 'Drosselfreie Laststeuerung miy vollvariablen Ventiltrieben'. |   52 | DE-Z.: Korrespondenz Abwasser 38(1991), S. 228-34 |   53 | U.S. Appl. No. 13/608,744. |   54 | JPN6013021469; MAALEJ N et al: 'Antithrombotic Effect of Flavonoids in Red Wine' ACS Symp Ser No.661, 1997, Page.247-260 |   55 | Dixon et al., Ann. Rev. Pharmacol. Toxicol., 1980, p. 441-462, 20. |   56 | Albery et al., Amperometric enzyme electrodes , Phil. Trans. R. Soc. Long., vol. B 316, pp. 107 119 (1987). |   57 | Kniskern, P. J. et al., Gene 46, 135 (1986) (Kniskern I). |   58 | Prospekt, VVS-Isolering der Fa. Gullfiber, 1979 |   59 | Crosslinking Polymer CA 81(24):153514t Kajiyama et al. Feb. 1970. |   60 | M. J. GROGAN; M. R. PRATT; L. A. MARCAURELLE; C. R. BERTOZZI, ANNU. REV. BIOCHEM., vol. 71, 2002, pages 593 - 634 |   61 | U.S. Appl. No. 11/090,432. |   62 | SAMBROOK, J.; RUSSELL, D. W.: 'Molecular Cloning: a Laboratory Manual', 2001, COLD SPRING HARBOR LABORATORY |   63 | BiliBed® Phototherapy System, Medela AG,, 6 pages, 2008. |   64 | JPN7011004201; J. Natl. Cancer Inst. (1997) vol.89, no.4, p.293-300 |   65 | GUSTAFSSON ET AL., N ENGL. J. MED., vol. 334, 1996, pages 349 - 355 |   66 | Sommer-Knudsen, J. et al., Hydroxyproline-Rich Plant Glycoproteins, Phytochemistry, 1998, 47(4): 483-497. |   67 | Kaiser, Amino Acids 2012, 42, 679-684 |   68 | CA113(8): 68388q, 1989. |   69 | Honée, G., Convents, D., Van Rie, J., Jansens, S., Peferoen, M., Visser, B. The C-terminal domain of the toxic fragment of a Bacillus thuringiensis crystal protein determines receptor binding. (1991) Mol. Microbiol. 5:2799-2806. |   70 | Zhang et al., Acta Pharmacol. Sinica 27(2): 179-183 (2006). |   71 | Franz et al., (1980) Pflugeos arch., p. R2. |   72 | 王建新: '《化妆品植物原料大全》', 30 June 2012 |   73 | Hall et al., Carcinogenesis 2000; 21: 53-60. |   74 | Lahourcade, Lise , et al., 'Molecular beam epitaxy of semipolar AlN(1122) and GaN(1122) on m-sapphire', J Mater Sci: Mater Electron, No. 19, (2008), pp. 805-809. |   75 | JPN6012063322; JETI Vol.55, No.13, 2007, p.35-37 |   76 | Norm DIN EN 14604 |   77 | M. Aldissi et al., Polymer, vol. 23, pp. 243 245, (1982). |   78 | XP002900204 |   79 | JPN6012065635; Usha R Deshpande et al: Indian Journal of Experimental Biology 36(6), 1998, p.573-577 |   80 | Bowie et al. (1990) Science 247 : 1306-1310. |   81 | ALTSCHUL ET AL., J. MOL. BIOL., vol. 215, 1990, pages 403 - 410 |   82 | SP 103 bulletin. |   83 | Liu,, '99mTc-Labeling of a Hydrazinonicotinamide-Conjugated Vitronectin Receptor Antagonist Useful For Imaging Tumors' Bioconjugate Chem. 2001, 12, 623-629. |   84 | Pereira et al. Polymorphism of Human Cytomegalovirus Glycoproteins Characterized by Monoclonal Antibodies Virology (1984) 139:73 86. |   85 | Thompson, J.Virol. 61: 229 232 (1987). |   86 | B. Kumar and J. Kumar, J. Electrochem. Soc., 2010, 157, A611. |   87 | Rauvala et al., Biochim. Biophys. Acta 531: 266 274, 1978. |   88 | Carvajal et al., J. Vet. Diagn. Invest., 7:60-64, (1995). |   89 | Okabe, et al. J. Org. Chem. 56:4392 (1991). |   90 | ORGANIC LETTERS, 2000, pages 1749 - 51 |   91 | Einde et al., JFS, 2003, Vol. 68, No. 8, p. 2396-2404. |   92 | DIN 3223
cverluise commented 3 years ago

Addressed in v03 🎉 . The npl_cat classifier was trained on examples in english (and unknown) only. A npl_cat_flag bool was added to the v03. npl_cat_flag: