Closed LinguList closed 4 years ago
If we want to minimize file size, then we should
tuple
s rather than dict
s and not pretty printed JSON (indent=None
)I tried the indent = None, it is only reducing from 1,3 to 1,2, so less that I hoped for. The largest aspect are the long features here, so zipping should have an extreme reduction effect. But is the zipfile handling in Python something that is working without thirdparties, and thus amenable for linse? Then I'd really follow the zipping strategy...
Note that I deliberately include the whole clts-names now, since often people look at the symbol, and do not see what this grapheme is in CLTS, e.g., á is an a with a high tone, etc., so I assume that at least in some situations it will be useful to include the information of the full feature set.
Okay, I figured, to make it easier to add the dumped data to another repository, it was useful to add a destination argument. In this way, one can directly dump it into linse, to update it from there.
As to zip-compression, I was careful, as I don't know if the more powerful methods are available. If you have preferences that can be done without requirements, @xrotwang, I can look into this as well.
BTW: if you agree with this "destination" and the general idea of a "dump" command: I would then do the same for pyconcepticon. The idea would be: whenever we have larger collections in reference catalogs, which we want to re-use in smaller datasets, we allow to "dump" them to some destination, where they can then be reused. But first, I'll check how well I can use the zipped dump from within linse, to write the profiles.
ok, I will have a look at this PR now.
stupid spelling error, sorry...
Merging #10 into master will increase coverage by
1.21%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #10 +/- ##
==========================================
+ Coverage 94.67% 95.89% +1.21%
==========================================
Files 26 26
Lines 1071 1072 +1
==========================================
+ Hits 1014 1028 +14
+ Misses 57 44 -13
Impacted Files | Coverage Δ | |
---|---|---|
src/pyclts/__main__.py | 100.00% <ø> (+22.85%) |
:arrow_up: |
src/pyclts/commands/dump.py | 100.00% <100.00%> (+6.94%) |
:arrow_up: |
tests/test_cli.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 4842f1f...f4566c0. Read the comment docs.
Looks good, I think. I'd still like to see full coverage of src/pyclts/transcriptionsystem.py
in tests - but that's unrelated to the issue here.
Yes, I was just checking the code, and saw that this is still a bit messy.
An idea occurred to me: the normaliziation commands for approximate IPA, with typical mistakes, like ":" being used for the length marker, maybe we better add them to linse, than to clts? We could hard-code these aliases in a dictionary, and we'd assume -- in most cases -- that these are the same rules for all transcriptionsystems.
That would help to reduce a bit the number of different files in clts, and it would offer the functions inside from linse, where they may come in quite handy.
the normalize.tsv
has currently about 50 elements in clts.bipa
Hm. I'm not sure I'd like moving this part of complexity. I'd find it more transparent if linse
was purely about the data structure. I guess there's no way around some complexity - even complexity of the "cross-cutting" type - here, some of which may simply be inherited as subset of UNICODE.
Best we can do is probably to package stuff like "normalization" into functions that are useful for "end-users" and at some point require users of pyclts
to do normalization on their end, before calling other pyclts
functionality.
Basically, I'd prefer to concentrate the messiness of transcriptions in pyclts
and make other packages work on "pure" BIPA, blissfully ignoring the mess.
Yep, and normalization CAN also be dataset specific. We have the normalization-bits still in linse now, as we need them to get at least good results for a mainstream IPA, but this is a list of 50 elements, that does not even need to be curated much. So agreed: we leave things as they are and have a workable solution now for linse, etc. to use clts without the pyclts/clts complexity.
This creates a new file in
clts/data
, in form of a json dump of all graphemes found in the data with their bipa equivalent. File size is 1.3 MB.