clics / pyclics

python package implementing the CLICS processing workflow
Apache License 2.0
3 stars 0 forks source link

Fixed bug in unmapped concept purging; better report. #20

Closed tresoldi closed 5 years ago

tresoldi commented 5 years ago

Solves and implements what is debated on issue #19.

Problematic concepts are now purged:

(env) tresoldi@shh.mpg.de@dlt5802808l:~/src/clics3$ clics load concepticon-data/ glottolog/
INFO    using IClicsForm implementation pyclics.plugin:clics_form
INFO    loading datasets into clics.sqlite
INFO    loading logos
INFO    loading allenbai
INFO    purging 2 problematic concepts from db.
INFO    loading bantubvd
INFO    purging 10 problematic concepts from db.
INFO    loading beidasinitic
INFO    purging 192 problematic concepts from db.
(...)

The internal dataset is preserved, but the header was changed from Glosses to Parameters:

#    Dataset              Parameters    Concepticon    Varieties    Glottocodes    Families
---  -----------------  ------------  -------------  -----------  -------------  ----------
1    abrahammonpa                304            304           26             15           2
2    allenbai                    497            496            9              9           1
3    bantubvd                    420            415           10             10           1
(...)

And the results are confirmed to be the same:

  ID A  Concept A                   ID B  Concept B                     Families    Languages    Words
------  ------------------------  ------  --------------------------  ----------  -----------  -------
  1370  MONTH                       1313  MOON                                57          320      328
  1803  WOOD                         906  TREE                                57          298      405
    72  CLAW                        1258  FINGERNAIL                          55          217      225
  3210  KNIFE (FOR EATING)          1352  KNIFE                               51          268      285
  2266  SON-IN-LAW (OF WOMAN)       2267  SON-IN-LAW (OF MAN)                 49          261      284
  1599  WORD                        1307  LANGUAGE                            49          113      118
  1297  LEG                         1301  FOOT                                48          272      289
   763  SKIN                        1204  BARK                                48          195      213
  1608  LISTEN                      1408  HEAR                                48          114      117
  2265  DAUGHTER-IN-LAW (OF MAN)    2264  DAUGHTER-IN-LAW (OF WOMAN)          47          234      261
xrotwang commented 5 years ago

Ah, sorry for breaking the tests - we'd need to pin pyglottolog, i.e. require

pyglottolog~=1.0

here https://github.com/clics/pyclics/blob/8915b2ab855f0fe996933f5a6efbfbaf5fa934aa/setup.py#L31

tresoldi commented 5 years ago

Thank you, @xrotwang . Passing tests now. :smiley:

xrotwang commented 5 years ago

Thinking about it, the better way to fix the tests would be by providing an acceptable directory to instantiate a pyglottolog.Glottolog object here https://github.com/clics/pyclics/blob/8915b2ab855f0fe996933f5a6efbfbaf5fa934aa/tests/test_commands.py#L21 We'd need to add directories references and languoids/tree.

Limiting to pyglottolog < 2 is an artificial restriction, because pyglottolog 2.0 is compatible "enough" for our purposes.

tresoldi commented 5 years ago

This would be similar to the tests in pylexibank, correct? Like here https://github.com/lexibank/pylexibank/blob/9e549ed0b740510c190bf26573695299045f97ec/tests/test_commands.py

We should probably open a new issues for that, however.