buriy / spacy-ru

Russian language models for spaCy
MIT License
242 stars 29 forks source link

experiments on POS, DEP with vectors added #20

Closed buriy closed 4 years ago

buriy commented 4 years ago

vec = None:

Itn  Tag Loss    Tag %    Dep Loss    UAS     LAS    
---  ---------  --------  ---------  ------  ------  
  1  223640.276    88.463  730880.328  83.596  76.853
  2  148058.327    90.114  584817.044  85.881  80.250
  3  130717.553    90.877  527316.186  87.052  81.986
  4  119762.511    91.401  493267.792  87.695  82.952
  5  111856.057    91.757  466701.151  88.174  83.654
  6  105576.776    92.050  445781.350  88.633  84.272
  7  100923.145    92.217  430490.035  88.926  84.673
  8   97009.482    92.397  417598.703  89.172  84.987
  9   93537.305    92.519  406327.851  89.273  85.147
 10   91249.011    92.654  397778.879  89.548  85.465
 11   88429.176    92.771  389028.793  89.692  85.661
 12   85904.785    92.846  380605.604  89.827  85.890
 13   84379.842    92.945  374382.171  89.918  85.980
 14   82414.800    92.938  369255.081  89.951  86.076
 15   80689.611    93.013  362918.682  90.034  86.171
 16   79344.376    93.090  358687.210  90.089  86.275
 17   77798.324    93.168  352626.479  90.212  86.402
 18   76770.146    93.173  347953.520  90.219  86.430
 19   75340.301    93.213  343867.337  90.201  86.446
 20   74087.540    93.253  340865.751  90.302  86.569

vec = paragidms_from_navec, width=96:

Itn  Tag Loss    Tag %    Dep Loss    UAS     LAS    
---  ---------  --------  ---------  ------  ------  
  1  220025.384    88.646  728001.125  83.780  77.188
  2  142958.333    90.305  581105.955  86.018  80.505
  3  126015.430    91.099  524746.151  87.134  82.155
  4  114848.158    91.614  491072.387  87.801  83.135
  5  107458.344    91.988  466497.460  88.293  83.815
  6  101008.529    92.211  443295.962  88.528  84.287
  7   96426.611    92.439  428372.915  88.882  84.740 
  8   92170.875    92.579  414975.195  89.066  85.038 
  9   88581.034    92.696  403277.928  89.242  85.302 
 10   86098.166    92.868  393953.664  89.376  85.525 
 11   83284.911    92.935  384577.104  89.584  85.776 
 12   80973.447    93.056  377567.428  89.719  85.966 
 13   78867.902    93.147  369650.711  89.820  86.118 
 14   77335.574    93.197  363767.986  90.017  86.345 
 15   75022.879    93.239  356907.395  90.049  86.399 
 16   73710.124    93.278  352162.579  90.068  86.468 
 17   72468.511    93.312  347531.785  90.204  86.621 
 18   71213.013    93.342  342620.525  90.269  86.714 
 19   69827.644    93.367  339066.409  90.341  86.760 
 20   68615.096    93.403  333796.950  90.340  86.800 
sskorol commented 4 years ago

Vocab: https://gist.github.com/sskorol/3f654f257c57f04fec4a6244402d2ae8 width = 150

Itn  Tag Loss    Tag %    Dep Loss    UAS     LAS    NER Loss   NER P   NER R   NER F   Token %  CPU WPS  GPU WPS
---  ---------  --------  ---------  ------  ------  ---------  ------  ------  ------  -------  -------  -------
  1  195670.409    89.856  672265.634  85.334  79.454      0.000   0.000   0.000   0.000  100.000     5470    10082                                            
  2  123066.648    91.355  519681.819  87.293  82.405      0.000   0.000   0.000   0.000  100.000     5470    10164                                            
  3  106329.392    92.066  466157.118  88.241  83.827      0.000   0.000   0.000   0.000  100.000     5470    10265                                            
  4  95541.371    92.501  431755.374  88.849  84.679      0.000   0.000   0.000   0.000  100.000     5470    10120                                             
  5  87373.762    92.830  405544.697  89.302  85.328      0.000   0.000   0.000   0.000  100.000     5470    10165                                             
  6  81018.798    93.035  386272.139  89.593  85.777      0.000   0.000   0.000   0.000  100.000     5470    10161                                             
  7  76154.969    93.221  369641.625  89.955  86.197      0.000   0.000   0.000   0.000  100.000     5470    10254                                             
  8  71413.536    93.303  356122.455  90.107  86.369      0.000   0.000   0.000   0.000  100.000     5470    10277                                             
  9  68078.113    93.377  342631.283  90.176  86.483      0.000   0.000   0.000   0.000  100.000     5470    10199                                             
 10  64366.264    93.459  333533.511  90.230  86.599      0.000   0.000   0.000   0.000  100.000     5470    10203                                             
 11  61935.280    93.583  323958.284  90.339  86.788      0.000   0.000   0.000   0.000  100.000     5470    10204                                             
 12  59330.765    93.604  315470.553  90.440  86.907      0.000   0.000   0.000   0.000  100.000     5470    10233                                             
 13  57306.374    93.643  306508.847  90.552  87.035      0.000   0.000   0.000   0.000  100.000     5470    10159                                             
 14  55086.122    93.636  300465.398  90.598  87.090      0.000   0.000   0.000   0.000  100.000     5470    10084                                             
 15  53226.834    93.667  293474.921  90.596  87.139      0.000   0.000   0.000   0.000  100.000     5470     9950                                             
 16  51378.742    93.736  286699.144  90.597  87.164      0.000   0.000   0.000   0.000  100.000     5470     9894                                             
 17  49850.338    93.753  281217.184  90.558  87.154      0.000   0.000   0.000   0.000  100.000     5470    10332                                             
 18  48744.009    93.771  275373.145  90.627  87.284      0.000   0.000   0.000   0.000  100.000     5470    10339                                             
 19  47157.016    93.761  270591.588  90.720  87.332      0.000   0.000   0.000   0.000  100.000     5470    10324                                             
 20  45887.111    93.772  267271.456  90.718  87.338      0.000   0.000   0.000   0.000  100.000     5470    10285
buriy commented 4 years ago

vec = norms_from_navec, width=96, https://gist.github.com/buriy/f81330ccd5f35e503f957e96844540e3/revisions#diff-27299c9d2f12875b2e5f4fce3ce484f3:

Itn  Tag Loss    Tag %    Dep Loss    UAS     LAS    NER L
---  ---------  --------  ---------  ------  ------  -----
  1  217942.606    88.781  716078.814  83.983  77.543     
  2  141262.220    90.420  572827.089  86.235  80.805     
  3  124264.333    91.266  518904.142  87.323  82.419     
  4  113914.387    91.712  482734.570  87.997  83.472     
  5  106446.137    92.102  458973.689  88.457  84.146     
  6  100019.644    92.343  439558.357  88.811  84.642      
  7   95041.872    92.522  422688.017  89.092  85.050      
  8   91603.579    92.685  409935.470  89.272  85.307      
  9   87856.239    92.822  399263.286  89.391  85.522      
 10   85022.764    92.951  387969.582  89.481  85.629      
 11   82410.319    93.015  379647.275  89.584  85.783      
 12   80401.838    93.084  372475.272  89.784  86.044      
 13   78053.943    93.168  366079.551  89.851  86.145      
 14   76207.141    93.235  360122.291  89.884  86.201
 15   74474.127    93.297  354038.745  89.948  86.304      
 16   72706.680    93.358  347340.824  90.061  86.456      
 17   71683.732    93.386  341853.629  90.116  86.537      
 18   70011.172    93.439  340336.212  90.126  86.556      
 19   69002.519    93.447  334133.481  90.228  86.665      
 20   68094.355    93.488  330321.146  90.332  86.811      
 21   66957.630    93.512  327471.405  90.365  86.871      
 22   65769.578    93.515  321939.197  90.379  86.890      
 23   64771.484    93.563  320606.388  90.357  86.905      
 24   64398.250    93.585  317066.967  90.370  86.915      
 25   63262.954    93.593  314013.264  90.433  86.987      
 26   62345.851    93.628  310481.751  90.451  87.005      
 27   61370.863    93.637  308041.837  90.468  87.049      
 28   61462.744    93.623  307849.631  90.498  87.070      
 29   60199.385    93.649  301794.233  90.494  87.061      
 30   59873.190    93.661  301905.787  90.528  87.101      
buriy commented 4 years ago

No vectors, width=150

Itn  Tag Loss    Tag %    Dep Loss    UAS     LAS    
---  ---------  --------  ---------  ------  ------  
  1  200550.240    89.650  683498.798  85.063  79.134
  2  127058.552    91.217  528910.494  87.142  82.170
  3  110075.959    91.884  473528.988  88.208  83.618
  4  99022.254    92.303  437539.379  88.860  84.571 
  5  91371.813    92.559  409869.130  89.174  85.010 
  6  84689.915    92.734  390382.618  89.518  85.538 
  7  80111.442    92.938  376035.951  89.641  85.795 
  8  75410.372    93.085  362495.646  89.886  86.105 
  9  72101.657    93.129  350447.011  90.066  86.322 
 10  68761.490    93.236  337636.480  90.223  86.539 
 11  66325.080    93.309  328696.397  90.276  86.642 
 12  64174.671    93.360  320164.963  90.243  86.676 
 13  61822.824    93.413  312061.878  90.382  86.857 
 14  59674.548    93.442  304749.814  90.498  87.021 
 15  57705.098    93.528  300285.224  90.531  87.044 
 16  55977.893    93.561  293359.634  90.627  87.202 
 17  54286.165    93.547  287182.622  90.592  87.181 
 18  53102.329    93.563  281234.365  90.609  87.233 
 19  51901.250    93.573  278506.581  90.643  87.273 
 20  51107.783    93.586  273447.448  90.670  87.316 
 21  49394.585    93.599  268765.360  90.770  87.409 
 22  48609.565    93.614  265592.125  90.736  87.372 
 23  47323.302    93.622  259167.853  90.815  87.435 
 24  46299.187    93.663  257274.539  90.789  87.389 
 25  45238.024    93.688  253004.762  90.793  87.427 
 26  44462.939    93.678  249416.752  90.738  87.384 
 27  43852.407    93.675  246350.867  90.763  87.432 
 28  43061.078    93.705  244398.172  90.728  87.393 
 29  42539.750    93.682  241708.512  90.769  87.476 
 30  41564.851    93.683  239089.832  90.814  87.563 

vec = norms_from_navec, width=150, https://gist.github.com/buriy/f81330ccd5f35e503f957e96844540e3/revisions#diff-27299c9d2f12875b2e5f4fce3ce484f3:

  1  196810.935    89.901  674082.843  85.317  79.467
  2  123477.646    91.380  523418.743  87.304  82.415
  3  106914.275    92.034  469262.359  88.214  83.830
  4  95881.340    92.474  435113.845  88.763  84.616 
  5  87536.242    92.732  407065.647  89.154  85.151 
  6  81377.265    92.992  388305.329  89.533  85.629 
  7  76279.151    93.135  372260.317  89.724  85.935 
  8  71530.780    93.297  356597.583  90.060  86.362 
  9  68102.305    93.367  343777.439  90.145  86.477 
 10  64555.731    93.440  335099.408  90.244  86.667 
 11  61823.762    93.514  323844.126  90.359  86.827 
 12  59382.968    93.578  315458.675  90.385  86.875 
 13  57174.704    93.619  306645.989  90.409  86.928 
 14  55390.045    93.703  298922.144  90.473  86.986 
 15  53294.908    93.741  294415.630  90.562  87.131 
 16  51505.326    93.728  286795.211  90.550  87.144 
 17  49953.103    93.787  281430.921  90.614  87.233 
 18  48327.442    93.778  277675.906  90.652  87.295 
 19  47433.790    93.814  271525.058  90.681  87.320 
 20  46165.790    93.848  267326.015  90.710  87.362 
 21  45132.669    93.847  262522.712  90.757  87.406 
 22  43535.840    93.851  258846.360  90.841  87.510 
 23  42617.882    93.875  253999.910  90.820  87.519 
 24  42065.062    93.890  250185.802  90.879  87.581 
 25  40989.635    93.909  246938.856  90.910  87.618 
 26  39756.463    93.891  243510.403  90.894  87.637 
 27  39459.779    93.907  240215.944  90.882  87.605 
 28  38455.344    93.902  237592.473  90.860  87.559 
 29  37653.153    93.924  234728.028  90.919  87.633 
 30  37382.722    93.913  230929.532  90.936  87.666 
sskorol commented 4 years ago

Same args, different preparation script: https://gist.github.com/sskorol/2dcc110a58e932810ec55671e849a979

Itn  Tag Loss    Tag %    Dep Loss    UAS     LAS
---  ---------  --------  ---------  ------  ------
  1  200112.873    89.756  675646.251  85.087  79.200                 
  2  124427.645    91.356  522325.208  87.304  82.381                                            
  3  107209.685    92.043  467316.867  88.067  83.612                                            
  4  96644.060    92.461  431185.378  88.596  84.386                                              
  5  88648.057    92.815  404570.997  89.096  85.078                                              
  6  81967.809    92.944  384839.287  89.466  85.587                                              
  7  77161.391    93.122  366746.242  89.776  85.998                                              
  8  72891.981    93.242  353431.963  89.968  86.227                                              
  9  69274.212    93.326  342155.651  90.045  86.356                                              
 10  65959.241    93.384  330291.187  90.184  86.503                                              
 11  63091.543    93.458  322220.869  90.170  86.585                                              
 12  61104.035    93.528  311960.291  90.257  86.682                                              
 13  58465.460    93.581  304510.071  90.311  86.768                                              
 14  56823.556    93.638  297195.149  90.422  86.921                                              
 15  55011.108    93.693  290235.511  90.503  87.053                                              
 16  52995.132    93.709  284636.749  90.481  87.028                                              
 17  51415.107    93.739  279206.300  90.620  87.188                                              
 18  49913.032    93.730  273005.355  90.620  87.253                                              
 19  48875.219    93.706  267365.928  90.675  87.331                                              
 20  47712.145    93.744  263814.256  90.670  87.311                                              
 21  46486.756    93.743  261534.572  90.689  87.352                                              
 22  45320.962    93.776  255665.435  90.777  87.459                                              
 23  44482.011    93.801  252113.816  90.768  87.451                                              
 24  43456.462    93.786  248063.270  90.744  87.444                                              
 25  42495.427    93.787  245076.256  90.744  87.485                                              
 26  41839.088    93.810  241856.637  90.721  87.464                                              
 27  40520.905    93.824  238334.513  90.724  87.483                                              
 28  40194.237    93.805  234280.121  90.797  87.564                                              
 29  39270.977    93.829  231577.732  90.776  87.547                                              
 30  38981.384    93.827  229523.775  90.861  87.624
sskorol commented 4 years ago

Fixed script: https://gist.github.com/sskorol/2dcc110a58e932810ec55671e849a979

Itn  Tag Loss    Tag %    Dep Loss    UAS     LAS
---  ---------  --------  ---------  ------  ------
  1  161145.583    92.742  632702.367  86.757  81.768
  2  93855.727    93.684  479985.515  88.517  84.389
  3  80398.103    94.214  429999.728  89.312  85.553
  4  72063.494    94.456  396952.023  89.880  86.432
  5  66128.879    94.642  373615.299  90.050  86.731
  6  61062.543    94.739  353606.956  90.396  87.187
  7  57032.251    94.825  338820.866  90.516  87.351
  8  53676.331    94.913  324472.700  90.627  87.539
  9  50491.066    94.948  313799.272  90.769  87.725
 10  47884.488    94.972  301719.675  90.881  87.877
 11  45713.239    95.014  294273.747  90.966  87.990
 12  43999.539    95.054  284167.030  90.994  88.088
 13  42280.597    95.092  278617.027  91.045  88.168
 14  40497.084    95.091  271569.468  91.064  88.239
 15  38970.900    95.058  265302.393  91.088  88.282
 16  37319.622    95.052  259257.846  91.173  88.372
 17  36727.700    95.080  254876.380  91.216  88.412
 18  35479.840    95.076  249545.049  91.250  88.460
 19  34442.474    95.083  244911.117  91.299  88.524
 20  33085.346    95.082  239358.418  91.304  88.526
 21  32468.793    95.114  235457.510  91.376  88.576
 22  31371.001    95.129  232571.953  91.386  88.591
 23  30669.177    95.124  228429.651  91.382  88.597
 24  30256.596    95.123  226056.217  91.336  88.542
 25  29340.200    95.140  221773.170  91.348  88.589
 26  28469.800    95.152  218234.168  91.385  88.600
 27  28012.716    95.166  215142.502  91.369  88.601
 28  27136.307    95.194  212505.200  91.413  88.660
 29  26781.810    95.206  210085.084  91.404  88.638
 30  26448.669    95.202  207351.591  91.388  88.635
sskorol commented 4 years ago

Added tags to vectors: https://gist.github.com/sskorol/2dcc110a58e932810ec55671e849a979

Itn  Tag Loss    Tag %    Dep Loss    UAS     LAS
---  ---------  --------  ---------  ------  ------
  1  148579.293    93.289  602149.213  87.441  82.868
  2  86696.120    94.096  461888.448  89.006  85.180
  3  74762.131    94.472  415305.713  89.697  86.195
  4  66960.488    94.667  384642.778  90.292  86.975
  5  61450.382    94.830  360781.163  90.503  87.291
  6  56722.733    94.955  343727.112  90.727  87.587
  7  53009.846    95.001  328229.325  90.834  87.801
  8  50004.985    95.054  314413.512  90.970  87.975
  9  47342.091    95.083  303663.900  91.089  88.138
 10  44426.744    95.161  294157.662  91.210  88.281
 11  42814.691    95.206  284806.755  91.214  88.331
 12  40708.372    95.219  277395.015  91.212  88.381
 13  39001.517    95.294  271132.501  91.200  88.379
 14  37216.418    95.285  263681.832  91.277  88.496
 15  36106.172    95.287  256903.847  91.345  88.589
 16  34565.183    95.275  252155.354  91.396  88.636
 17  33421.656    95.259  246689.826  91.434  88.681
 18  32859.640    95.306  241977.253  91.496  88.754
 19  31423.111    95.321  236369.869  91.491  88.730
 20  30424.943    95.317  232551.682  91.479  88.721
 21  29464.565    95.340  227281.395  91.502  88.758
 22  28680.611    95.350  224486.824  91.480  88.760
 23  28129.433    95.344  220624.970  91.505  88.788
 24  27250.755    95.371  218851.535  91.568  88.834
 25  26648.756    95.376  214631.149  91.613  88.866
 26  25724.630    95.381  211835.240  91.641  88.906
 27  25347.578    95.392  208937.050  91.579  88.869
 28  24685.303    95.375  203972.069  91.610  88.912
 29  24342.255    95.418  202738.656  91.621  88.927
 30  24096.019    95.385  200364.226  91.641  88.959
sbushmanov commented 4 years ago

Training and testing on nerus (?) dataset is a special case for a news media domain. As some people pointed out the NER e.g. won't recognize lower case entities. It would be interesting to see if adding some noise (lowercasing, misspelling) would lead to recognising more entities, say in a casual chat.

buriy commented 4 years ago

@sbushmanov experiments say that while this is improving quality on lowercase-entries, this is decreasing quality on well-written texts. so if you have mixed cases, you need a mixed-case model for such texts, but better to use a good model for properly spelled/cased texts.