diacritics / database

A reliable diacritics database with their associated ASCII characters
MIT License
11 stars 4 forks source link

Concept & Specification #1

Closed julkue closed 7 years ago

julkue commented 8 years ago

The meaning behind this repository is to collect diacritics with their associated ASCII characters in a structured form. It should be the central place for various projects when it comes to diacritics mapping.

As there is no single, trustworthy and complete source, all information need to be collected by users manually.

Example mapping:

Schön => Schoen Schoen => Schön

User Requirements

Someone using diacritics mapping information.

It should be possible to:

  1. Output diacritics mapping information in a CLI and web interface
  2. Output diacritics mapping information for various languages, e.g. a JavaScript array/object
  3. Fetch diacritics mapping information in builds
  4. Filter diacritics mapping information based by:
    • By diacritic
    • By mapping value
    • By language
    • By continent
    • By alphabet (e.g. Latin)

Contributor Requirements

Someone providing diacritics mapping information.

Assuming every contributor has a GitHub account and is familiar with Git.

Providing information should be:

  1. Easy to collect
  2. Possible without manual invitations
  3. Possible without registration (an exception is: "Register with GitHub")
  4. Done at one place
  5. Easy to check correctness of information structure
  6. Checkable before acceptance by another contributor familiar with the language
  7. Possible without a Git clone

System Specification

There are two ways of realization:

  1. Create a JSON database in this GitHub repository, as this fits user and contributor requirements.
  2. Create a database in a third-party service that fits the user and contributor requirements.

    Tested:

    • Transifex: Doesn't fit requirements. It would allow providing mapping information, but not metadata.
    • Contentful: Doesn't fit requirements. It would require a manual invitation and registration.

Because we're not familiar with further third-party services that could fit user and contributor requirements, we'll continue realizing the first point.

System Requirements

See the documentation and pull request.

Build & Distribution

Build

According to the contributor requirements it should be possible to compile source files without making a Git clone necessary. This means that we can't require users to run e.g. $ grunt dist at the and, since this would require to clone, install dependencies and run things. What we'll do is implementing a build bot that will run our build on Travis CI and commits changes directly to a dist branch in this repository. Therefore once you merge something or you commit something yourself the dist branch will be updated automatically. Some people already doing this to update their gh-pages branch when something changes in the master branch (e.g. this script).

Since we'll use a server-side component to filter and serve actual mapping information we just need to generate one diacritics.json file containing all data.

To make parsing easier and to encode diacritics to unicode numbers in production we're going to need a build that minifies the files and encodes diacritics. This should be done using Grunt.

Integrity

In order to ensure integrity and consistency we need the following in our build process:

Distribution

To provide diacritics mapping according to the User Requirements it's necessary to run a custom server-side component that makes it possible to sort, limit and filter information and output them in different ways (e.g. JS object or array). This component should be realized using Node.js as it's made for handling JS/JSON files and PHP would cause a lot more serializing/deserializing.

Next Steps


This comment is updated continuously during the discussion

julkue commented 8 years ago

@Mottie Do you know any other third-party services worth mentioning? Do you agree with the specified requirements or do you have any other ideas or concerns to share?

Mottie commented 8 years ago

I don't know of any other third-party services, but I'm sure there are more. I'll keep an eye out.

I like what you have so far. I do have a few points I wanted to add:

Also, I would love to hear what ideas @mathiasbynens might have on this subject.

julkue commented 8 years ago

Thanks for sharing your ideas, @Mottie.

The file names should also include the territory, just as the CLDR database is set up. For example, the German language should include all of these files [...]

Why do you think this is necessary? In case of German there aren't any differences between e.g. Austria or Switzerland dialects.

Normalization of the code is still important as a diacritic can be represented in more than one way

What would be your solution approach here?

Collation rules might help with diacritic mapping for each language

How would you integrate these rules into the creation process of diacritics mapping?

Btw: As a collaborator you're allowed to update the specifications too.

Mottie commented 8 years ago

Why do you think this is necessary? In case of German there aren't any differences between e.g. Austria or Switzerland dialects.

It's more of a "just-in-case there are differences" matter. I'm not saying duplicate the file for all territories.

What would be your solution approach here?

Well I don't think we'd need to do the normalization ourselves, but we would need to cover all the bases... maybe? If I use the example from this page for the latin capital letter A with a ring above, the data section would need to look something like this:

"data":{
  // Latin capital letter A with ring above (U+00C5)
  "Å":{
    "mapping":"A",
    "equivalents" : [
      "Å", // Angstrom sign (U+212B)
      "Å", // A (U+0041) + Combining Ring above (U+030A)

      // maybe include the key that wraps this object as well?
      "Å" // Latin capital letter A with ring above (U+00C5)
    ]
  }
}

Btw: As a collaborator you're allowed to update the specifications too.

I know I am, but we're still discussing the layout :wink:

julkue commented 8 years ago

It's more of a "just-in-case there are differences" matter. I'm not saying duplicate the file for all territories.

That makes sense. Adding additional language variants would be optional. I've added this to the SysReq.

Well I don't think we'd need to do the normalization ourselves, but we would need to cover all the bases

Good catch! I didn't know what you meant here on first glance – because I'm not familiar with any language that has such equavalents. Added this to the SysReq too.

Mottie commented 8 years ago

I've updated the spec example. A combining diaeresis can be used in combination with just about any letter. It's not based on the language, it's a unicode method to create characters that visibly appear as the same character.

There is another useful site I found, but it's not working for me at the moment - http://shapecatcher.com/

andrewplummer commented 8 years ago

Hello,

The Sugar diacritics were based purely on my own research with a healthy dose of consulting with my European friends/followers. The goal was simply to provide an "80% good" sort order for most languages compared to the default Unicode code point sort in raw Javascript. It's not intended to be anything close to a complete collation algorithm, let alone an authority on diacritic mappings.

I definitely agree that there is a need for this. For my part, I would probably not add it as a dependency to Sugar but instead make sure that a project that adds it could quickly set up the sorting to work with it.

Thanks for working on this!

julkue commented 8 years ago

@Mottie

A combining diaeresis can be used in combination with just about any letter. It's not based on the language, it's a unicode method to create characters that visibly appear as the same character.

Thanks, just learned something new. I think adding those equivalents will be up to the authors most of the time, as users probably don't know much about them.

I've invited the author of shapecatcher to particiate at this discussion.

@andrewplummer Thank you for participating! 👍

I would probably not add it as a dependency to Sugar but instead make sure that a project that adds it could quickly set up the sorting to work with it.

I see two ways of distribution:

Whould the latter something you'd be interested in using? I'm asking because it's important to know if that would be a way library authors could imagine to integrate this. If not, what would be your preferred way? @Mottie What do you think about this distribution?

Mottie commented 8 years ago

Thanks @andrewplummer! you :rocket:!

@julmot

Mottie commented 8 years ago

Next question. To allow adding comments to the source files, would you prefer to make them:

julkue commented 8 years ago

@Mottie

Yes, distribution by bower and npm are pretty much given. As long as we provide optimized data (e.g. based on the diacritic, language, etc.) I think we'll be fine. I'm sure the users will let us know if we need to add more.

When I was talking about distribution using Bower I didn't meant to distribute the actual data. I meant a build helper that then fetches the data from this repository. This way we can have a specific version for our build helper but our users will always get the latest diacritics mapping information. I see a few ways here:

While I personally don't like to create a server side component, I also see that there would be many file variants. We'd need to specify a good dist structure to make finding things easily if we opt for the former.

What do you think?

I'm not sure that "continent" is needed in the data, or what should be done if the language isn't associated with one, e.g. Esperanto. Would "null" be appropriate?

No, it's not needed but would be a nice-to-have. Imagine a customer is distributing an application to a continent, e.g. Europe. Then it wouldn't be necessary to just include all mapping information by selecting every EU-country manually. In case a country is associated with multiple continents like Russia we'd need to specify them inside an array. I don't know any accepted language that isn't associated with a country. Esperanto seems like an idea of hippies, I'd vote for just ignoring it as there'll probably be no significant demand. But if we include it, I'd just add every continent inside an array, as it's globally available.

I think adding "native" (or equivalent) to the metadata would also be beneficial

Great idea. It would then be possible to select country specific diacritic mapping information by native language spellings. But would be another variant to consider in the distribution (see above).

Mapping should be provided with the character with the accent removed and decomposed. If you look at this section on dealing with umlauts, you'll see that there are three ways to deal with them.

Related to this article I agree with you and I'd vote for using it like you've did, having a base and decompose property when available, otherwise a simple string.

While attempting to construct a few files, I found that it was difficult to determine if an equivalent was a single unicode character or a combination. I know you like to see the actual character, but maybe for the equivalents it would be best to use the unicode value. I'm just not sure how to make and edit equivalents easier.

I agree with you. It's also hard to review when there is no visual difference. Would you mind to update the system requirements with this information?

Another open question about equivalents for me is who will collect them? We can't expect that users will do this and in this case how to integreate this into the workflow? When a user submits a pull request containing a new language we'd need to merge it and then adding the equivalents in the master branch.

Next question. To allow adding comments to the source files, would you prefer to make them

I'd prefer using strict JSON as .js files to allow code formatting (won't work with atom-beautify otherwise) and don't treat errors in text editors when adding comments. We'd need to integreate a JSON validator in the build. We'd also need to integrate a components that makes sure all database files are correctly formatted (according to a code style). And finally, we need to create a few code style files before (e.g. .jsbeautify).

Mottie commented 8 years ago

I meant a build helper that then fetches the data from this repository.

That sounds like a good idea, there likely will be many variants. But sadly, my knowledge of pretty much all things server-side is severely lacking, so I won't be of much help there.

Esperanto seems like an idea of hippies, I'd vote for just ignoring it as there'll probably be no significant demand.

LOL, that's fine

Another open question about equivalents for me is who will collect them?

That's when shapecatcher and FileFormat.info become useful! I can help work on the initial data collection. And since visual equivalents won't change for a given character, a separate folder that contains the equivalents data would be much easier to maintain. We can then use reference these files in the build process.

src/
├── de/
│   ├── de.js
├── equivalents/
│   ├── ü.js
│   ├── ö.js
│   ├── ä.js

I'm not sure if using the actual character would fair well with accessing the file, so maybe using the unicode value would be better (e.g. u-00fc.js instead of ü.js)?

Inside of the file:

/**
 * Visual equivalents of ü
 */
[
    "u\u0308", // u + Combining diaeresis (U+0308)
    "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
]

If, for some unique reason a character has a different equivalent, we could define it in the language file and then concatenate the equivalents values? Or not concatenate at all depending on what we discover. Actually now that I think about it, I remember reading somewhere that some fonts provide custom characters in the unicode private areas, but lets not worry about that unless it comes up.

We'd need to integreate a JSON validator in the build.

The Grunt file uses grunt.file.readJSON and within the process of building the files we'll end up using JSON.parse and JSON.stringify which will throw errors if the JSON isn't valid. I think it would be difficult to validate the JSON before the comments are stripped out.

As for beautifying the JSON, JSON.stringify would do that:

JSON.stringify({a:"b"}, null, 4);
/* result:
{
    "a": "b"
}
*/
julkue commented 8 years ago

That sounds like a good idea, there likely will be many variants. But sadly, my knowledge of pretty much all things server-side is severely lacking, so I won't be of much help there.

I'm quite sure you can help here, you just don't know yet :laughing: If we decide to implement a server-side component then we'll set it up using Node.js as we're handling only JS/JSON files and using it makes it a lot easier than e.g. PHP. While you might not be familiar with it in detail, if I set up the basics you'll probably understand it quickly. Anyway, to find a conclusion at this point I think we need to realize a server-side component. Otherwise many variants will be necessary and it might be confusing to have that many files in a dist folder.

And since visual equivalents won't change for a given character, a separate folder that contains the equivalents data would be much easier to maintain.

Sorry, I didn't understand the benefit of this when we're going to collect them using the unicode number. Could you help me understanding the benefit by explaining it a little more?

JSON.parse and JSON.stringify which will throw errors if the JSON isn't valid.

That would be enough.

As for beautifying the JSON, JSON.stringify would do that

I didn't meant to beautify them in the build, I meant to implement a build integration that checks if they are correctly formatted inside the src folder. Beautifying won't be necessary for the output.

What do you think of my question in your PR?

Wouldn't it make sense to provide the sources in the metadata object instead of comments? When they would be entered by users manually without providing sources we could fill in "provided by users" or something similiar.

Mottie commented 8 years ago

Could you help me understanding the benefit by explaining it a little more?

Well when it comes to normalization, there are a limited number of visual equivalents for each given character. When we list the equivalents for a ü, we'll be repeating the same values in multiple languages. I was proposing centralizing these values in one place, then adding them to the language during a build, but only if the "equivalents" value is undefined in the language file and there is an existing equivalents file for the character.

Example language file:

"data": {
    "ü": {
        "mapping": {
            "base": "u",
            "decompose": "ue"
        }
        // no equivalents added here
    },
    ...

equivalents file

/**
 * Visual equivalents of ü
 */
[
    "u\u0308", // u + Combining diaeresis (U+0308)
    "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
]

Resulting file:

"data": {
    "ü": {
        "mapping": {
            "base": "u",
            "decompose": "ue"
        }
        "equivalents": [
            "u\u0308",
            "\u00fc"
        ]
    },
    ...

I hope I better explained my idea.

provide the sources in the metadata object instead of comments?

Yes, that is a better idea. I guess I missed that question in the PR. I'll update the spec.

julkue commented 8 years ago

@Mottie I understand this. But what I still don't understand is the benefit of saving them in a separate file

I was proposing centralizing these values in one place

Saving them in the "equivalents" property would be one central place too?

Mottie commented 8 years ago

Saving them in the "equivalents" property would be one central place too?

Yes, that would work too. Where would be the placement of that value within the file?

julkue commented 8 years ago

Like you've specified, in the equivalents property.

        "ü": {
            "mapping": {
                "base": "u",
                "decompose": "ue"
            },
            "equivalents": [
                "u\u0308", // u + Combining diaeresis (U+0308)
                "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
            ]
        }

I'm quite sure we're misunderstanding us at some point, but I'm not sure at which one.

Mottie commented 8 years ago

What I'm saying is, for example, if you look at this list of alphabets, under "Latin", you'll see there are a lot of languages that use the á diacritic. Instead of maintaining a list of visual equivalents for that one diacritic within each language file, we centralize it in one place, but add it to each file during the build process.

Mottie commented 8 years ago

Vietnamese is going to be fun... there are a lot of diacritics with multiple combining marks which may be added in different orders.

ẫ = a + ̃  + ̂  OR a + ̂  + ̃ 

Which means the equivalents array would need to include each order combination.

"ẫ" : [
    "a\u0303\u0302", // a + ̃  + ̂ 
    "a\u0302\u0303", // a + ̂  + ̃ 
    "\u1eab"         // ẫ
]
Mottie commented 8 years ago

Did that clarify things? And what are we going to do about uppercase characters?

julkue commented 8 years ago

@Mottie

Did that clarify things?

Yes, thanks! I think I've understand you now. You meant to exclude them into separate files, to avoid redundant information in mapping files.

Seems like a good idea. Let's talk about the filenames. Naming them like the diacritic itself may cause issues on some operating systems. But naming them like the unicode number will make it impossible to find them quickly. Maybe we could map them by giving them a unique ID? Or do you see any alternatives? Example:

"data": {
    "ü": {
        "mapping": {
            "base": "u",
            "decompose": "ue"
        },
        "equivalents": 1 // 1 would be a filename: ../equivalents/1.js
    },
    ...

Not the most beautiful variant though.

And what are we going to do about uppercase characters?

I've replied to this question here

There may be diacritics only available in upper case characters. To play it safe I'd include also upper case diacritics. Don't you think so?

Mottie commented 8 years ago

Maybe we could map them by giving them a unique ID? Or do you see any alternatives?

Now that I've counted how many diacritics are listed just in the "Latin" table (246), I think it might be a better idea to group them a little LOL. I thought about grouping by the "base" letter (with a fallback to the "decompose" value) so there could be around 26 files (not counting characters that need to be encoded), but we haven't even considered languages like Arabic, Chinese and Japanese of which I have no clue how to begin. Should we even worry about non-Latin based languages at this stage?

If the "base" value was a character that needed encoding (e.g. ß, then I think the unicode value would be the best ID for the file. Something like u-00df.js?.

upper case characters

Including both upper and lower case would be the best idea then.

julkue commented 8 years ago

I'll come back to this tomorrow with a clear head. GN8 :crescent_moon:

julkue commented 8 years ago

thought about grouping by the "base" letter (with a fallback to the "decompose" value) so there could be around 26 files (not counting characters that need to be encoded)

Could you update the spec with this?

but we haven't even considered languages like Arabic, Chinese and Japanese

Absolutely right. Before we start implementing the database, we should have a layout that works in all languages. I've tried to find out if there are any cases that wouldn't work with our current schema, but weren't successfully. We'll need someone familiar with these languages...

I'd like to ask @gromo if you can help us out. We'd like to know if the Arabic alphabet contains diacritics like e.g. Latin and if they can be mapped to a "base" character (e.g. "u" when the diacritic is "ü")? Hopefully you're familiar with this alphabet as someone living in Uzbekistan. I'd appreciate your answer!

Mottie commented 8 years ago

Could you update the spec with this?

Done. I've updated the spec (and PR). Let me know what you think.

Also, I think ligatures (æ decomposes to ae) need to be mentioned in the spec since they aren't "officially" named diacritics.

gromo commented 8 years ago

@julmot uzbek language uses cyrilic / latin alphabet, so I cannot help you with this

Mottie commented 8 years ago

@gromo we could still use some feedback :grin:

@julmot I forgot to ask, does ß only map to SS (uppercase)? If I use the following javascript, it gives interesting results:

console.log('ß'.toLowerCase(), 'ß'.toUpperCase());
// result: ß SS
julkue commented 8 years ago

@Mottie

Done. I've updated the spec (and PR). Let me know what you think.

Thank you, well done!

I have a few question about the equivalents spec now:

  1. When we're going to add an equivalent file for every base character, then this would also include equivalents from different languages with the same base character. Am I right? If so, then this wouldn't allow an output for a specific language only. With that in mind, does this still make sense?
  2. If yes, assuming the following situation: You have a base character and an equivalents file exists, but you don't want to include it and also don't want to overwrite it with manual equivalents, what would be necessary to exclude the equivalents file?
  3. Do you expect that overwriting an equivalents file manually happens often? If so, it might be confusing after a time, as we have no centralized place for them – what's the idea behind.
  4. Is there a common use case for including HTML codes? Browsers will render them to the actual character and non-JS languages don't need them.
  5. Might be important to note how to determine a filename like u1eb9u0301.js.

does ß only map to SS (uppercase)?

ß is a lower case character. That means, there is no ß for upper case sentences. Even though a character exists (upper case ß), but it isn't used in the German language (not approved officially), just in Unicode. For the mapping that means:

ß in lower case => ss

ß doesn't map to SS.

@gromo Thanks for your quick reply. I'm sorry to hear that, I thought that because this page is listing "Uzbek" under "Arabic". Anyway, could you answer the same questions for the cyrillic alphabet?

gromo commented 8 years ago

@julmot It's not clear for me what you're working on, but I think the letters you're looking for are (letters => base letters):

Russian: Ёё => Ее Йй => Ии

Uzbek: Gʻgʻ => Gg Oʻoʻ => Oo

Mottie commented 8 years ago

also include equivalents from different languages with the same base character

Yes. I'm not sure it would matter though because we providing a list of visual equivalents. The user can choose to use or not use the list. Am I mistaken here? We're just providing data, we aren't manipulating anything.

what would be necessary to exclude the equivalents file?

I was envisioning that if an equivalents was defined (even an empty array or string) in the language file, then the equivalents would not be added. If we're providing a list to an end user, then I think they could choose to include or not include the data.

Do you expect that overwriting an equivalents file manually happens often?

I doubt it. I was thinking that it should be an option though.

Is there a common use case for including HTML codes?

I was actually thinking that a user may be parsing an HTML file for other purposes. If it does sound like a good idea, what about URI encoding? encodeURI('ä') => "%C3%A4". Too much? I do tend to go overboard on ideas LOL.

Might be important to note how to determine a filename like u1eb9u0301.js

I was actually playing around with an idea of building the equivalents JSON - see this demo - it's just a preliminary idea. A third cross-reference of actual equivalents would need to be used and included in the build process (e.g. "ö" = "\u04e7" // Cyrillic small letter o with diaeresis (U+04E7)). This way, we wouldn't need to manually edit the JSON. This might even change my idea of having an equivalents folder and just create a temporary JSON file for cross-referencing into the main language file during the build. What do you think?

andrewplummer commented 8 years ago

@julmot To answer your above question, the src files I use can be parsed directly, which is something I wouldn't like to lose, so I probably wouldn't use the <% diacritics %> idea. Instead, I would probably prefer to modify my code to "speak" the format that you guys are deciding upon here. Support for this format could be added in a minor version and my format would be deprecated in a major version. The user would then link these at runtime.

If I might add to the discussion, I read through the thread and had the same idea as @Mottie that the filenames should include the territory, i.e. de_AT.js. If there are no dialects or the dialects are all equivalent, it could just be de.js.

Lastly, I can tell you that Chinese and Korean should not have diacritic marks. Japanese has two: the voiced and semi-voiced sound marks. Unicode reserves a combining form for both of these but I've never seen them used as the pre-combined form all have their own codepoints (number of combinations is rather limited).

julkue commented 8 years ago

@gromo Thanks for your feedback. Just two more questions:

  1. Are that all diacritic characters in Russian and Uzbek?
  2. Are that the mappings to the real meaning behind the diacritics or just the ASCII equivalents? To make my question more clear: "ö" in German would map to "o" in ASCII, but the real meaning behind is "oe".

@andrewplummer Thank you for your answer!

Support for this format could be added in a minor version and my format would be deprecated in a major version. The user would then link these at runtime.

Sorry, but I don't fully understand you here. What would a user link? A file that overwrites a method? I so, then this file would also need the diacritics mapping information, which would mean that at least in this file a <% diacritics %> (or something similar) would be necessary.

Lastly, I can tell you that Chinese and Korean should not have diacritic marks. Japanese has two

Thank you very much for this information. @Mottie I guess this makes it simpler here and we don't need to spend much time on this.

@Mottie

I was actually thinking that a user may be parsing an HTML file for other purposes. If it does sound like a good idea, what about URI encoding? encodeURI('ä') => "%C3%A4". Too much? I do tend to go overboard on ideas LOL.

😆 If things are getting generated automatically, I don't see a disadvantage. Otherwise, no, I wouldn't include this.

I was actually playing around with an idea of building the equivalents JSON

and

Yes. I'm not sure it would matter though because we providing a list of visual equivalents. The user can choose to use or not use the list. Am I mistaken here? We're just providing data, we aren't manipulating anything.

I have to spend more time on this, understanding the equivalents thing and the automatic generation. Currently I'm not having much time and starting next week I'll be on vacation. But I'll let you know when I progressed. In the meantime I'd like to let you know that I'll convert the diacritics project to an organization. This has a few benefits:

  1. It makes clear that this isn't grown solely on my shoulders. Even if it was my idea it makes clear that many people are necessary to build this. And since you're spending the same time like me, I think it's fair.
  2. We will need multiple repositories – one as the database repository (this one), one for the server side component and at least one Node.js build helper, additionally a Grunt task – and this conveys a clear togetherness.
  3. We can create teams for reviewers, maybe even per language. Of course we need to find volunteers first :bowtie:

I've bought the domain diacritics.io already, that temporary redirects to this repository as long as we don't have a website.

andrewplummer commented 8 years ago

Sorry, but I don't fully understand you here.

So a specific example would be something like:

Sugar.Array.setOption('collateEquivalents', obj);

Where obj is a Javascript object following your above format, and essentially mapping every entry in data to it's equivalent in decompose for the purpose of collation. Sugar doesn't handle this option per-locale as it is not that advanced yet, but that support could theoretically be added.

FWIW checked with my Unicode-obsessed friend and the combined diacritic forms may possibly be used in Japanese in some very rare cases.

andrewplummer commented 8 years ago

@Mottie Not sure about the above notification? I just added a comment... didn't mean to unassign you??

gromo commented 8 years ago

@julmot

  1. Are that all diacritic characters in Russian and Uzbek?

In russian, I believe, yes. Also, according to https://en.wikipedia.org/wiki/Diacritic :

Belarusian and Russian have the letter ё. In Russian, this letter is usually replaced by е, although it has a different pronunciation. The use of е instead of ё does not affect the pronunciation. Ё is always used in children's books and in dictionaries. A minimal pair is все (vs'e, "everybody" pl.) and всё (vs'o, "everything" n. sg.). In Belarusian the replacement by е is a mistake, in Russian, it is permissible to use either е or ё for ё but the former is more common in everyday writing (as opposed to instructional or juvenile writing).

There is only one diacritic letter - Ёё.

According to https://en.wikipedia.org/wiki/Uzbek_alphabet there are diacritic letters Oʻ oʻ & Gʻ gʻ in modern latin-based alphabet. And in uzbek cyrillic: Ҳ ҳ => Х х, Қ қ => К к, Ў ў => У у, Ғ ғ => Г г.

When the Uzbek language is written using the Latin script, the letters Oʻ (Cyrillic Ў) and Gʻ (Cyrillic Ғ) are properly rendered using the character U+02BB ʻ MODIFIER LETTER TURNED COMMA.[5] However, since this character is absent from most keyboard layouts and many fonts, most Uzbek websites – including some operated by the Uzbek government[2] – use either U+2018 ‘ LEFT SINGLE QUOTATION MARK or straight (typewriter) single quotes to represent these letters.

julkue commented 8 years ago

@andrewplummer

Not sure about the above notification? I just added a comment... didn't mean to unassign you??

Whoops, I think we've just encountered a GitHub bug. I've removed Rob from the contributors, as I've invited him to join the new diacritics organization. Until he accepts that invitation, he's no longer available for assignment. GitHub possibly detected that and since you were the first user that had an action in this issue after I've removed him, they declared you to this action. Strange...

So a specific example would be something like

I understand now. So basically you'd not include the diacritics in your files, but would refactor the structure internally to allow overwriting them by users. This is a new use case, as this means that the actual data (mapping information) needs to be available locally, not just fetched in builds. Also this means, that every library author would need to implement such method, to allow overwriting. I think that might be a good approach for your specific library, but not generally. It depends on the setup. For example I'd like to implement the mapping information into mark.js, where I would just set a placeholder in a build template to include the mapping object. Accessing my source files in production isn't allowed, so that wouldn't be a problem. @Mottie What do you think of this?

andrewplummer commented 8 years ago

Wow ok nice bug :)

Yeah, to be honest I think that my use case may not accurately represent the most common use case you will likely encounter. It's a bit of an outlier. I would definitely get some opinions from libraries that would make better use of the data.

julkue commented 8 years ago

@andrewplummer

I would definitely get some opinions from libraries that would make better use of the data.

Thanks for the feedback! Do you have something in mind?

andrewplummer commented 8 years ago

Paging @Kimtaro! My above mentioned friend is not only a Unicode nut but also a linguistics nut. He's also Swedish, and they love their diacritics.

Mottie commented 8 years ago

Chinese and Korean should not have diacritic marks. Japanese has two

❤️ Thanks for letting us know @andrewplummer!

If things are getting generated automatically, I don't see a disadvantage.

So far, I've collected data from several sites and this demo contains the current result. Here is a snippet showing just the á entry:

    "á": [
        // U+00E1 - LATIN SMALL LETTER A WITH ACUTE
        "\u00e1",
        "&#225;", // HTML decimal code
        "&#x00e1;" // HTML hex code
        "&aacute;", // HTML common entity code
        `a${ACUTE}`
    ]

The code is very messy, but I'll get it cleaned up.

I'll convert the diacritics project to an organization.

Awesome!

I just added a comment... didn't mean to unassign you??

OMG @andrewplummer, why?!

LOL, I left GitHub a little message to let them know.

According to {wikipedia}

@gromo As much as I love wikipedia, would you consider those entries accurate? Are there any other sources that you've found that supports the information? Either way, thank you for the update!

gromo commented 8 years ago

@Mottie In this case wikipedia contains more detailed information than I have living in Uzbekistan. If you look at my previous message, I've mentioned the same letters based on my knowledge. The only difference is letter Йй in cyrillic alphabet - I'm not sure if it can be count as diacritic of letter Ии, but I saw in some systems that Й was replaced by И and unicode char after it.

Mottie commented 8 years ago

I found this valuable resource! http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt (See the "precomposed" section half way down). The main issue is that it was never approved and was deprecated (ref). So, even though it is directly related to our database, would it still be a good idea to use it?

Secondly, I saw this post in the Elasticsearch blog about ascii (diacritic) folding... they use ASCII Folding Filter from the Lucene Java Library. I'm not sure where to find that java code, but I suspect they are doing the same thing as in the DiacriticFolding.txt file... I will search for the resource later (after some sleep).

Update: https://github.com/GVdP-Travix/ASCIIFoldingFilter/blob/master/ASCIIFoldingFilter.js

julkue commented 8 years ago

@Mottie Thanks for this idea.

First of, the deprecated DiacriticFolding won't be something we can use as we need to guarantee correctness.

I've had a look at the Elasticsearch site you're referring but wasn't able to find the original "ASCIIFolding" project (mapping database). So I've only had a look at the JS mapping you've provided.

From my point of view this would only be a solution for the base property, as the decomposed value isn't covered. For example I've searched for "ü" and only found a mapping to "u". On the other hand, "ß" is mapped to "ss" which is contradictory.

Therefore I have the following questions:

  1. Is this a trustful source?
  2. Is the data covering all necessary base mappings? (they specify covered Unicode blocks)
  3. Is the data covering the correct base mappings? (for example we've defined ß as the mapping for ß, they are defining ss)
Mottie commented 8 years ago

I don't know the specifics, but the DiacriticFolding was created by a member of the Unicode group. So that may not guarantee correctness, but it might be about as close as we can get right now.

And yeah, I agree that the "ASCIIFolding" should only be used for the base mapping.

Is this a trustful source?

I think so. Elastisearch is an open source RESTful search engine with almost 20k stars. Users are relying on the ASCII folding to perform searches.

Is the data covering all necessary base mappings?

It looks like they are mapping only by unicode blocks and not by language. But in their case, the ASCII folding doesn't differ, it looks like they are basically stripping away any diacritic. Which is essentially what the DiacriticFolding file looks like it is doing.

Is the data covering the correct base mappings?

I'm not sure how to answer this question. ß isn't really a diacritic, so stripping away any diacritics from the character doesn't apply; I think that's why we chose to leave it unchanged. I guess what our question should be is how should we define the base map of a character? Should it be the character without any diacritic(s), or as in the case of the ASCIIFolding, should it convert the character to accommodate searches from U.S. keyboards?

julkue commented 8 years ago

So that may not guarantee correctness, but it might be about as close as we can get right now.

As "ASCIIFolding" seems to contain the same information, I think we should focus on that?

I think so. Elastisearch is an open source RESTful search engine with almost 20k stars. Users are relying on the ASCII folding to perform searches.

I know Elasticsearch, but as I couldn't find the database they're using I assume that they're using it from a third-party too? In that case, we don't need to find out if Elasticsearch is trustful, but the third-party ("ASCIIFoldingFilter.js"). We also need to make sure that we can use their database from a legal point of view.

Should it be the character without any diacritic(s), or as in the case of the ASCIIFolding, should it convert the character to accommodate searches from U.S. keyboards?

We can make this decision easy: If we're going to use their database, we need to use what they're providing.

Mottie commented 7 years ago

It looks like they use an apache license (http://www.apache.org/licenses/LICENSE-2.0)

Source: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_solr_4_5_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java

What I'm actually saying is that I think this code is essentially implementing the "DiacriticFolding" data from the Unicode organization; but I can't say for sure until I actually compare the two.


Interestingly, this port to Go (https://github.com/hkulekci/ascii-folding) has an MIT license.

julkue commented 7 years ago

It looks like they use an apache license

I'm not a lawyer, but according to the license it allows usage with copyright notice. However, in users end products there can't be a notice since it'll just be e.g. a regex (no logic). I'm not sure if it's done with providing their copyright in our database. To guarantee we're allowed to use it we need to contact them.

Thanks for providing the Java file, that helped.

What I'm actually saying is that I think this code is essentially implementing the "DiacriticFolding" data from the Unicode organization; but I can't say for sure until I actually compare the two.

Interestingly. We should investigate this and find out if we can use the database. If so, I'd agree to use it to automatically generate the base property. But we need to document especially the case what happens with characters like ß that aren't diacritics. The special thing about ß is that when writing something uppercase it's replaced by SS otherwise ss.

Interestingly, this port to Go (https://github.com/hkulekci/ascii-folding) has an MIT license.

@hkulekci Can we assume that this is a mistake?

hkulekci commented 7 years ago

@julmot yeah, you can. I am not good at licensing something. I was only trying to exampling something in golang. :) I guess, in this case, I must choose apache license. If you know, please correct me which license I should choose.

julkue commented 7 years ago

@hkulekci No, sorry, I don't know it too. But since this project is released under the MIT license and we'd like to use this database, this is of our interest too.

@Mottie I you have time, could you please find out one owner of the provided Java library and contact him regarding the usage (and set me cc: please)? There's another question he probably can answer. I just asked myself : is the mapping e.g. ü => u common in all languages except German, where it could also be mapped to ue? I mean, if German is the only language that needs the decompose property, and all other languages are just having a base, then houston we have a problem. Then the entire database would be pointless as everything is already covered in the ASCIIFolding project.

Mottie commented 7 years ago

Sorry, I've had a busy day; I'm just now checking GitHub.

I do know there is at least one additional language that needs diacritics decomposed... I found this issue (https://github.com/OpenRefine/OpenRefine/issues/650#issuecomment-12886256) about the Norwegian language: