cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Change orthography profile format default to "csv" #15

Closed LinguList closed 7 years ago

LinguList commented 7 years ago

The "prf" ending creates inconveniencies for unexperienced users, as they usually do not know how to quickly open them in spreadsheet software. Furthermore, on github, they are displayed as pure text, rather than the table-form which would make it easier to discuss, comment, and inspect online. Using "tsv", as "\t" is the default separator, seems like the best solution with github as a sharing platform in mind, and usually, "tsv" will be opened automatically in spreadsheet software on the major platforms. All that needs to be done then is to add some tutorial that illustrates to non-programmers how to save the files in text form and utf8.

bambooforest commented 7 years ago

Sounds good. Me I prefer transparent file naming, so I'm ok with TSV or CSV, since that's the file format for profiles. And thanks for this suggestion!

I'm having a hard time keeping up with all the work being done on lexibank -- are you using the segments package at all over there? If so, I can give this the highest priority. I'm working on a couple of case studies (Juptyer notebooks to go with the book) at the moment, so I could crank out the changes ASAP if that's helpful.

xrotwang commented 7 years ago

yes, the tokenizer of segments is used by default when an orthography profile is defined for a dataset. As far as I can tell, the segments package is completely agnostic with regards to the filename extension of profiles, see

$ grep -r "prf" segments | grep -v pyc
segments/tests/test_tokenizer.py:        self.t = Tokenizer(_test_path('test.prf'))
segments/tests/test_tokenizer.py:        t = Tokenizer(_test_path('test.prf'), errors_replace=lambda c: '<{0}>'.format(c))
segments/tests/test_tokenizer.py:        t = Tokenizer(_test_path('test.prf'), errors_replace=lambda c: '?')
segments/tests/test_cli.py:            Mock(args=[os.path.join(os.path.dirname(__file__), 'test.prf')],

So I guess the question is only, if the extension is part of the profile specification in the book. In any case, I think this issue can be closed.

bambooforest commented 7 years ago

The book simply says:

A1. A profile is a unicode UTF-8 encoded text file, ideally using NFC, no- BOM, and LF (see Section 2.11), that includes the information pertinent to the orthography.

A2. A profile is a tab-separated CSV file with an obligatory header line

So, yeah it doesn't matter what the file is named (but I can go ahead and update the .prf files and the calls to them in segments to be a bit more transparent, and so that they will display nicely in Git.)

xrotwang commented 7 years ago

Ok. Thanks for the info! So allowing comment lines in profiles is a feature (or a bug? :) ) of segments. Maybe this should be documented somewhere - or this should be added to the book?

LinguList commented 7 years ago

As of now, I think the consequent use of csv and the usage of columns to add comments, etc., would even allow to get rid of comments in the profile, making it also more convenient to edit them in spreadsheet software. One may think of restricting the conversion procedure in segments, as we currently use IPA, and other things, so that people don't convert to their comments (but on the other hand: who is going to to that? and I enjoy the flexibility)

bambooforest commented 7 years ago

Are you using comments in profiles in lexibank? We wrote here:

A4. Separate lines with comments are not allowed. Comments that be- long to specific lines will have to be put in a separate column of the CSV file, e.g. add a column called comments.

and that we put metadata in a separate file:

A3. Metadata are added in a separate UTF-8 text file using a basic tag: value format. Metadata about the orthographic description given in the orthography profile includes, minimally, (i) author, (ii) date, (iii) title of the profile, (iv) a stable language identifier encoded in BCP 47/ISO 639-3 of the target language of the profile, and (v) bibliographic data for resource(s) that illustrate the orthography described in the profile. Further, (vi) the tokenization method and (vii) the unicode normalisation used should be documented here (see below).

So, yes then my own test profiles are indeed bad because they include comments! Nice catch.

In the book the formal specification is on pages 82--86. Those might need to be updated, but right now yes, no comments.

(And you can tell Michael wrote this bit because he uses British spelling. Will have to fix that.)

xrotwang commented 7 years ago

Good! I think this is a clear case: @LinguList sorry, no flexibility! :) So, yes, let's kick out the comment handling in segments! Does this go for the rules file as well?

xrotwang commented 7 years ago

But we should add a comment column when creating an initial profile running segments profile, maybe?

xrotwang commented 7 years ago

@bambooforest and maybe the book should add glottocodes as valid language identifiers :)

LinguList commented 7 years ago

But one thing that NEEDS to be supported is to allow me to add a user-defined column for conversion (I'm fine with having to specify it in metadata), so I want to have different layers, like "IPA", "CLPA", whatever, you see? Fixing the "comment" to be always COMMENT is fine, and I agree there's no need for inline comments.

xrotwang commented 7 years ago

@LinguList But one can have any number of columns in orthography profiles already.

bambooforest commented 7 years ago

Yes, we can kick out comment handling.

A5. A minimal profile consists of a single column with a header called Grapheme, listing each of the different graphemes in a separate line. The name of this column is crucial for automatic processing.

That's the basic basic. Michael's version in R makes use of Left and Right context columns:

A6. Optional columns can be used to specify the left and right context of the grapheme, to be designated with the headers Left and Right, re- spectively. e same grapheme can occur multiple times with different contextual specifications, for example to distinguish different pronunciations depending on the context.

For him this was particularly important for processing Dutch orthography. He went as far as:

A7. The columns Grapheme, Left and Right can use regular expres- sion metacharacters. If regular expressions are used, then all literal us- age of the special symbols, like full stops <.> or dollar signs <$> (so-called metacharacters) have to be explicitly escaped by adding a backslash be- fore them (i.e. use <.> or <\$>). Note that any specification of context au- tomatically expects regular expressions, so it is probably be er to always escape all regular expression metacharacters when used literally in the or- thography. e following symbols will need to be preceded by a backslash: [](){}| +*.-!?ˆ$ and the backslash \ itself.

It's optional because we didn't agree on this approach and I never implemented it in segments. Something else optional -- and I think this fits in with @LinguList 's IPA, CLPA, etc:

A8. An optional column can be used to specify classes of graphemes, to be identified by the header Class. For example, this column can be used to define a class of vowels. Users can simply add ad-hoc identifiers in this column to indicate a group of graphemes, which can then be used in the description of the graphemes or the context.

Again, you can add whatever you want:

A10. Any other columns can be added freely, but will mostly be ignored by any software application using the profiles.

bambooforest commented 7 years ago

We do state that:

B1. Each line of a profile will be interpreted as a regular expression. So ware applications using pro files can also o er to interpret a profile in the literal sense to avoid the necessity for the user to escape regular expressions metacharacters in the profile. However, this only is possible when no contexts or classes are described, so this seems only useful in the most basic orthographies.

And this part then also gets hung up on the Left, Right context processing needed for the tricker cases of orthographic parsing (as in Dutch).

B2. The class column will be used to produce explicit or chains of regular expressions, which will then be inserted in the Grapheme, Left and Right columns at the position indicated by the class-identiers. For exam- ple, a class called V as a context specification might be replaced by a regular expression like: (a|e|i|o|u|ei|au). Only the graphemes themselves are included here, not any contexts specified for the elements of the class. Note that in some cases the ordering inside this regular expression might be crucial.

LinguList commented 7 years ago

@xrotwang, yes, but I wanted to make sure this stays like that ;-)

xrotwang commented 7 years ago

So that's another bug in segments: Currently we identify the Grapheme column by checking

if not self.column_labels and tokens[0].lower().startswith("graphemes"):
    ...

when we should check =='Grapheme' for any of the columns, right?

bambooforest commented 7 years ago

@xrotwang -- indeed this is a bug. We aren't explicit about it being first (I think that was Mattis's suggestion at one point and I just internalized it).

xrotwang commented 7 years ago

Ok. I'll put together a PR getting the code in line with the spec. We can't give any URL for the spec yet, can we?

bambooforest commented 7 years ago

Note also some issues that came up in the long thread on lexibank including errors/left-over characters, and separateor, we write:

B9. Leftover characters, i.e. characters that are not matched by the profile, should be reported to the user as errors. Typically, the un- matched character are replaced in the tokenization by a user-specified symbol-string.

Any so ware application offering to use orthography profile:

  1. should offer user-options to specify:

C1. the name of the column to be used for transliteration (if any).

C2. the symbol-string to be inserted between graphemes. Option- ally, a warning might be given if the chosen string includes characters from the orthography itself.

C3. the symbol-string to be inserted for unmatched strings in the tokenized and transliterated output.

C4. the tokenization method, i.e. whether the tokenization should pro- ceed global or linear (see B6 above).

C5. unicode normalization, i.e. whether the text-string and profile should use NFC or NFD.

LinguList commented 7 years ago

@xrotwang -- indeed this is a bug. We aren't explicit about it being first (I think that was Mattis's suggestion at one point and I just internalized it).

Yep, but it's useful to have this structure, as it is the source graphemes you are working with, right? LIke the ID, the thing you check for uniqueness, so keepign it in the first column is useful.

bambooforest commented 7 years ago

I was planning on put the book, use cases and specs here:

https://github.com/unicode-cookbook

xrotwang commented 7 years ago

@LinguList I guess keeping it in the first column is useful, but wouldn't force this. Since arbitrary other columns are allowed, it could easily be the case, that one of these is even more important for a particular use case. E.g. a Category column, maybe with values typo, which would signal to the linguist editing the profile: "don't spend time on this weird grapheme" ...

LinguList commented 7 years ago

I see and agree.

xrotwang commented 7 years ago

@bambooforest Considering that arbitrary additional columns are allowed and the format is specified as CSV also means, we cannot parse the profile naively, using line.split('\t'), because it may contain quoted cells with tabs, etc. Since python's csv module differs notably between 2.7 and 3.4, I think we need to use a protable csv reader library. My go-to library is clldutils :) This would add some unnecessary dependencies, but I think overall would be worth it (we'd also get object-oriented path handling and could throw out the docopt dependency). Ok with this?

bambooforest commented 7 years ago

sgtm. we're the only ones using the library and cli, so let requests come if anyone else starts using it. :)

xrotwang commented 7 years ago

I think this can be closed now.