DM2E / dm2e-mappings

0 stars 0 forks source link

{W131} String not in Unicode Normal Form C: "XV. Tsuṅ-hau." #75

Closed kba closed 10 years ago

kba commented 10 years ago

Validating DTA yields lots of these warnings. I'm no expert at Unicode but I remember that this has to do with combining characters. I'm not even sure if that is a problem for our triplestore but maybe @knepper can say something about that, as he had the same issue earlier.

ksdm2e commented 10 years ago

could you give an example?

ksdm2e commented 10 years ago

I have an example: the above mentioned line is from berg-ostasien04, e.g.

http://www.deutschestextarchiv.de/book/view/berg_ostasien04_1873/?hl=Tsun%CC%87-hau&p=55

Here is an explanation:

https://groups.google.com/forum/#!topic/topbraid-users/lxTn8NwiSaA

knepper commented 10 years ago

Short explanation is: In UTF-8 a common letter with diacritic can be coded in two ways. 1. A composed form which a single code for the whole thing, 2. A decomposed form wich is the pure letter code without diacritic followed by another character code for each diacritic. The second form is more flexible and the first form is limited to a (huge) number of letter+diacritic combination. And both are valid UTF-8. In NFC the 1st form is used if exisiting and the 2nd. otherwise. In NFD there is only the 2nd form. Note: Only NFD is "pure", NFC is a mixture. The browsers can handle only NFC with exceptions in some versions (Why?). The more common choice is NFC for his reason. Normalization (eihter NFC or NFD) is important for comparison of strings (searching) because there is no match between decomposed and composed form.

In this online example I find a 6e "n" and a cc 87 (COMBINING DOT ABOVE) in the word "Tsuṅ-luen". The composed form is e1 b9 85 (LATIN SMALL LETTER N WITH DOT ABOVE). The warning is correct: 6e cc 87 is NFD and e1 b9 85 is NFC. (And my browser displays the dot beside the n instead of above...)

For proper browser display the string should be NFC nomalized and care should be taken for the searching function.

kba commented 10 years ago

Interesting, thank you for the explanation.

@ksdm2e Can you normalize the literals in your script? FunctX has a function for just this: http://www.xsltfunctions.com/xsl/fn_normalize-unicode.html

ksdm2e commented 10 years ago

thanks @knepper . The reason is clear meanwhile.

The question is, whether it should be handled or ignored? What did you do?

I would ignore it in this first run, because it's only a warning for an old known problem (combining vs. one character). Modern search and indexing machines should handle this properly meanwhile. And the dataprovider decided for the combined solution and uses it also in its presentation plattform.

kba commented 10 years ago

I guess our search engine's indexing component can handle both forms, probably even prefers NCD cause it's easier to index diacritics if the components are known.

But I'd prioritize that the output can be displayed in common browsers, so +1 for handling this.

ksdm2e commented 10 years ago

Finallly: It's a good question from the second editor's point of view, whether it is allowed to change first editor's decisions.

@kba Thanks. It's also in http://www.w3.org/TR/xpath-functions/#func-normalize-unicode

I'll think about it ...

ksdm2e commented 10 years ago

If everybody agrees to normalize such strings, it might be a candidate for a recommendation.

knepper commented 10 years ago

I always do an NFC normalization prior to further processing (might be useful to remove cd 8f separating cc 88 (diaeresis) if you don't want a special diaeresis treatment):

  1. not to get trouble with searching, e.g. solr doesn't handle this if the forms a different and browser input is NFC
  2. not to get trouble with display in browser In this case it could fix the display issue I mentioned on the providers presentation.
kba commented 10 years ago

We could probably do this as part of the ingestion but that would intrude into the data provider's editing decisions and possibly raise hard-to-debug issues if there are problems.

If it were part of the specs that we only accept NFC normalized Unicode a check could be integrated into the validator to at least notify data mappers that there might be a problem.

ksdm2e commented 10 years ago

Thanks @knepper . You gave me the right point :D

I think we shouldn't "fix the display issue" because it is the dataprovider's choice. We only ship bytes to the database. I wouldn't make it a recommendation, but it is welcome in this case to be warned (because it might cause unexpected results).

d0rg0ld commented 10 years ago

Hi there, I will include the NFC normalization inthe recommendations!

kba commented 10 years ago

Validator wil warn abut non-NFC unicode strings, specs recommend it.