cldf-clts / pyclts

Apache License 2.0
11 stars 2 forks source link

modifying output of dump command #10

Closed LinguList closed 4 years ago

LinguList commented 4 years ago

This creates a new file in clts/data, in form of a json dump of all graphemes found in the data with their bipa equivalent. File size is 1.3 MB.

xrotwang commented 4 years ago

If we want to minimize file size, then we should

LinguList commented 4 years ago

I tried the indent = None, it is only reducing from 1,3 to 1,2, so less that I hoped for. The largest aspect are the long features here, so zipping should have an extreme reduction effect. But is the zipfile handling in Python something that is working without thirdparties, and thus amenable for linse? Then I'd really follow the zipping strategy...

LinguList commented 4 years ago

Note that I deliberately include the whole clts-names now, since often people look at the symbol, and do not see what this grapheme is in CLTS, e.g., á is an a with a high tone, etc., so I assume that at least in some situations it will be useful to include the information of the full feature set.

LinguList commented 4 years ago

Okay, I figured, to make it easier to add the dumped data to another repository, it was useful to add a destination argument. In this way, one can directly dump it into linse, to update it from there.

LinguList commented 4 years ago

As to zip-compression, I was careful, as I don't know if the more powerful methods are available. If you have preferences that can be done without requirements, @xrotwang, I can look into this as well.

LinguList commented 4 years ago

BTW: if you agree with this "destination" and the general idea of a "dump" command: I would then do the same for pyconcepticon. The idea would be: whenever we have larger collections in reference catalogs, which we want to re-use in smaller datasets, we allow to "dump" them to some destination, where they can then be reused. But first, I'll check how well I can use the zipped dump from within linse, to write the profiles.

xrotwang commented 4 years ago

ok, I will have a look at this PR now.

LinguList commented 4 years ago

stupid spelling error, sorry...

codecov-io commented 4 years ago

Codecov Report

Merging #10 into master will increase coverage by 1.21%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #10      +/-   ##
==========================================
+ Coverage   94.67%   95.89%   +1.21%     
==========================================
  Files          26       26              
  Lines        1071     1072       +1     
==========================================
+ Hits         1014     1028      +14     
+ Misses         57       44      -13     
Impacted Files Coverage Δ
src/pyclts/__main__.py 100.00% <ø> (+22.85%) :arrow_up:
src/pyclts/commands/dump.py 100.00% <100.00%> (+6.94%) :arrow_up:
tests/test_cli.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4842f1f...f4566c0. Read the comment docs.

xrotwang commented 4 years ago

Looks good, I think. I'd still like to see full coverage of src/pyclts/transcriptionsystem.py in tests - but that's unrelated to the issue here.

LinguList commented 4 years ago

Yes, I was just checking the code, and saw that this is still a bit messy.

An idea occurred to me: the normaliziation commands for approximate IPA, with typical mistakes, like ":" being used for the length marker, maybe we better add them to linse, than to clts? We could hard-code these aliases in a dictionary, and we'd assume -- in most cases -- that these are the same rules for all transcriptionsystems.

That would help to reduce a bit the number of different files in clts, and it would offer the functions inside from linse, where they may come in quite handy.

LinguList commented 4 years ago

the normalize.tsv has currently about 50 elements in clts.bipa

xrotwang commented 4 years ago

Hm. I'm not sure I'd like moving this part of complexity. I'd find it more transparent if linse was purely about the data structure. I guess there's no way around some complexity - even complexity of the "cross-cutting" type - here, some of which may simply be inherited as subset of UNICODE.

Best we can do is probably to package stuff like "normalization" into functions that are useful for "end-users" and at some point require users of pyclts to do normalization on their end, before calling other pyclts functionality.

xrotwang commented 4 years ago

Basically, I'd prefer to concentrate the messiness of transcriptions in pyclts and make other packages work on "pure" BIPA, blissfully ignoring the mess.

LinguList commented 4 years ago

Yep, and normalization CAN also be dataset specific. We have the normalization-bits still in linse now, as we need them to get at least good results for a mainstream IPA, but this is a list of 50 elements, that does not even need to be curated much. So agreed: we leave things as they are and have a workable solution now for linse, etc. to use clts without the pyclts/clts complexity.