diacritics / database

A reliable diacritics database with their associated ASCII characters
MIT License
11 stars 4 forks source link

Concept & Specification #1

Closed julkue closed 7 years ago

julkue commented 8 years ago

The meaning behind this repository is to collect diacritics with their associated ASCII characters in a structured form. It should be the central place for various projects when it comes to diacritics mapping.

As there is no single, trustworthy and complete source, all information need to be collected by users manually.

Example mapping:

Schön => Schoen Schoen => Schön

User Requirements

Someone using diacritics mapping information.

It should be possible to:

  1. Output diacritics mapping information in a CLI and web interface
  2. Output diacritics mapping information for various languages, e.g. a JavaScript array/object
  3. Fetch diacritics mapping information in builds
  4. Filter diacritics mapping information based by:
    • By diacritic
    • By mapping value
    • By language
    • By continent
    • By alphabet (e.g. Latin)

Contributor Requirements

Someone providing diacritics mapping information.

Assuming every contributor has a GitHub account and is familiar with Git.

Providing information should be:

  1. Easy to collect
  2. Possible without manual invitations
  3. Possible without registration (an exception is: "Register with GitHub")
  4. Done at one place
  5. Easy to check correctness of information structure
  6. Checkable before acceptance by another contributor familiar with the language
  7. Possible without a Git clone

System Specification

There are two ways of realization:

  1. Create a JSON database in this GitHub repository, as this fits user and contributor requirements.
  2. Create a database in a third-party service that fits the user and contributor requirements.

    Tested:

    • Transifex: Doesn't fit requirements. It would allow providing mapping information, but not metadata.
    • Contentful: Doesn't fit requirements. It would require a manual invitation and registration.

Because we're not familiar with further third-party services that could fit user and contributor requirements, we'll continue realizing the first point.

System Requirements

See the documentation and pull request.

Build & Distribution

Build

According to the contributor requirements it should be possible to compile source files without making a Git clone necessary. This means that we can't require users to run e.g. $ grunt dist at the and, since this would require to clone, install dependencies and run things. What we'll do is implementing a build bot that will run our build on Travis CI and commits changes directly to a dist branch in this repository. Therefore once you merge something or you commit something yourself the dist branch will be updated automatically. Some people already doing this to update their gh-pages branch when something changes in the master branch (e.g. this script).

Since we'll use a server-side component to filter and serve actual mapping information we just need to generate one diacritics.json file containing all data.

To make parsing easier and to encode diacritics to unicode numbers in production we're going to need a build that minifies the files and encodes diacritics. This should be done using Grunt.

Integrity

In order to ensure integrity and consistency we need the following in our build process:

Distribution

To provide diacritics mapping according to the User Requirements it's necessary to run a custom server-side component that makes it possible to sort, limit and filter information and output them in different ways (e.g. JS object or array). This component should be realized using Node.js as it's made for handling JS/JSON files and PHP would cause a lot more serializing/deserializing.

Next Steps


This comment is updated continuously during the discussion

julkue commented 7 years ago

@Mottie Thanks for this information. We definitely need to investigate this for more languages.

I've thought about this again and came to the conclusion that even if the decompose property is unnecessary in almost every language (e.g. except German and Norwegian), then the database still makes sense. Users can't use the ASCIIFolding class as it's not possible for them to integrate it (Java class). Our project would make it possible for them to use it. We're also providing metadata for all languages that allow users to filter them by their needs and processes to integrate it into their projects.

julkue commented 7 years ago

First of, I haven't received an answer from the lucene team regarding automatically generating the base property yet. Hopefully we have an answer soon.

Anyway, as soon as the API is merged the next step is to implement a process that allows users to integrate the diacritics project. We have several kind of projects:

I'd like to start discussing about JavaScript projects. We need to have a npm module that replaces placeholders with diacritics data. This module will use the API to fetch live data. There should be two possible placeholder types:

While a placeholder syntax like <% diacritics %> would make sense, it's probably not the best idea. Why? Because there might be projects using the source files in development, like mark.js. It tests with the source files and only runs unit tests with compiled files. If we would have above named syntax then an error would be thrown. To avoid this, we need to have a placeholder syntax that can simply be replaced but is also valid without the replacement. An example could be:

const x = [/* diacritics: /?language=DE */];

[/* diacritics: /?language=DE */] would be the placeholder. As the actual information is placed within a comment this would be valid even without the replacement. The diacritics would be the actual keyword here. Everything following by the : would be an optional filter URL that is passed to the API.

This is just an idea, not set in stone. I'm open for other ideas. Anyone?

Okay, for these projects that aren't using a build they'd need to create a module that overwrites a method in their project by using the npm module in a build. There won't be a way to use the diacritics project without this module (or without a build) as the data is fetched dynamically from the API.

@Mottie What do you think?

Mottie commented 7 years ago

Doesn't the API also need to indicate the type & format of the output?

/?diacritic=%C3%BC&output=base,decompose&format=string

I'm not yet clear on how we would get the API to only return the first equivalent, or a specific equivalent if there is one. Also that specified equivalent's specific data (e.g. unicode).


In the case of the mark.js repo, if you added a placeholder for say u using /?base=u, we'd need the API to return a string of all equivalents[0].raw to create the desired output of uùúûüůū.

julkue commented 7 years ago

You're right, there should be some parameters to ignore some data, e.g. a value to ignore equivalents or just some of the equivalents (by name, e.g. unicode). Ignoring either base or decompose makes so sense in my opinion, as both are optional and both are mapping information. Some diacritics have a base and no decompose and vice versa.

In the case of the mark.js repo, if you added a placeholder for say u using /?base=u, we'd need the API to return a string of all equivalents[0].raw to create the desired output of uùúûüůū.

Yes, that parameter format wouldn't be part of the API in my opinion. This is something you need to specify in the placeholder, but is handled by the npm module.
In case of mark.js that would be an entire array, not just limited by e.g. u.

julkue commented 7 years ago

I've thought about this again and making the format parameter part of the API has one benefit: It would allow access to these formats outside the npm module. This is especially helpful if we want to show the code on the website, the array or the entire method. Users could then just copy and paste the code into their applications – which would be another good solution for projects without a build. So I'm open for this option.

If we introduce an option to specify the output structure (non-JSON) then this shouldn't be a parameter in my opinion (e.g. ?output=js-array). All the things under the route / are currently generating JSON. So the cleanest thing would be to introduce a new route, e.g. /js-array/?language=DE, where js-array is the output structure and the parameters are just like for the / route.

@Mottie What do you think?

Mottie commented 7 years ago

Sorry for not responding earlier!

Do you still think that this will be necessary? I think the npm module would be able to output any format you need. I am starting to think that adding a new route would probably complicate the use of the API.

As an aside, I have started to work out what the npm module will provide and I've gotten stuck at how to deal with characters that are not going to be included under any language... like what happens when someone tried to remove diacritics from a crazy string like Iлtèrnåtïonɑlíƶatï߀ԉ? So I think the solution would be to create a en entry in the database that covers all the non-language specific diacritics. It's going to be huge.

julkue commented 7 years ago

Do you still think that this will be necessary? I think the npm module would be able to output any format you need. I am starting to think that adding a new route would probably complicate the use of the API.

I think it has one big advantage: It would allow copy and paste directly from the website. I currently imagine how our (later coming) website could look like. I imagine a website with a table full of diacritics, with their metadata and mapping information. You can filter and sort everything and finally you can just click "get code", select a language and structure (e.g. JavaScript and object or array) and get the code. This could be done using an option of the API. If it would be part of the npm module we would have to either create redundant code (npm module and website) or just don't implement such a button.

Interesting point in your second paragraph. Why would you call it "en"? And how would you map all kind of Unicode characters?

Mottie commented 7 years ago

I think it has one big advantage: It would allow copy and paste directly from the website.

Ok, sounds good then!

Why would you call it "en"?

Well, English doesn't really include diacritics, and even when we do, we ignore them all the time. I didn't want to name it something like default and make an exception in the spec. So, the format will follow the spec like all the other languages.

One block of entries will be removing all combining diacritics... so the base would be an empty string:

    "data": {
        // combining diacritics -> convert to empty string
        "\u0301": { "mapping": { "base": "" } },
        "\u0300": { "mapping": { "base": "" }},
        "\u0306": { "mapping": { "base": "" }},
        "\u0302": { "mapping": { "base": "" }},
        "\u030c": { "mapping": { "base": "" }},
        ...
    }

Then we could include the decomposing of other symbols like and into (1)...

julkue commented 7 years ago

@Mottie I think that including special characters that aren't diacritics make sense (e.g. "①"), but we can't call them a "diacritic".

I'd say we should decide if we're going to include them depending on the effort. Is there any existing database like the one for the HTML entities? If so, then we can continue creating a new file in the build folder and adding them to the generated diacritics.json. A new API option should also allow excluding them. If there's no database and it's much effort I don't think we can continue with it, or at least not at the current time. In my opinion we should focus on creating the npm module hopefully before new year. If this takes too much time it may be better to discuss this if we have time. In that case I'd personally find it confusing to name it "en" if English doesn't contain diacritics. I think the cleanest would be to just create a single JSON file directly in the src folder.

  1. Is there an existing database?
  2. How much time will it take to create that mapping information?
  3. In case there's no existing database: What do you think of the naming?
julkue commented 7 years ago

Btw.: Is the "Write" tab (textarea) also delayed for you while typing?

Mottie commented 7 years ago

I'm not having any issues with the textarea.

I started with a bunch of characters and plugged them into node-unidecode which stripped out the diacritics and then added them to the data set... although some results ended up as [?]. The list I was working on is no where near complete.

In the mean time, I'll put this part on hold and continue working on the npm module.

julkue commented 7 years ago

is no where near complete

How do you know that? And where did you take the data from?

julkue commented 7 years ago

Ping @Mottie. And what's the current status with the npm module spec?

Mottie commented 7 years ago

Hey @julmot!

I'll clean up what I have in the works and post it in about 4 hours (after I go to the gym)... it is still incomplete, but it'll give you an idea of where things are now.

Mottie commented 7 years ago

As I said, it's still a work-in-progress... https://github.com/diacritics/node-diacritics-transliterator/tree/initial

julkue commented 7 years ago

@Mottie Would you mind to submit a PR? This would allow us to have a conversation about it directly in the repository.

julkue commented 7 years ago

Finally, we're in the end-phase and going live soon.