Closed julkue closed 7 years ago
@Mottie Thanks for this information. We definitely need to investigate this for more languages.
I've thought about this again and came to the conclusion that even if the decompose
property is unnecessary in almost every language (e.g. except German and Norwegian), then the database still makes sense. Users can't use the ASCIIFolding class as it's not possible for them to integrate it (Java class). Our project would make it possible for them to use it. We're also providing metadata for all languages that allow users to filter them by their needs and processes to integrate it into their projects.
First of, I haven't received an answer from the lucene team regarding automatically generating the base
property yet. Hopefully we have an answer soon.
Anyway, as soon as the API is merged the next step is to implement a process that allows users to integrate the diacritics project. We have several kind of projects:
I'd like to start discussing about JavaScript projects. We need to have a npm module that replaces placeholders with diacritics data. This module will use the API to fetch live data. There should be two possible placeholder types:
While a placeholder syntax like <% diacritics %>
would make sense, it's probably not the best idea. Why? Because there might be projects using the source files in development, like mark.js. It tests with the source files and only runs unit tests with compiled files. If we would have above named syntax then an error would be thrown. To avoid this, we need to have a placeholder syntax that can simply be replaced but is also valid without the replacement. An example could be:
const x = [/* diacritics: /?language=DE */];
[/* diacritics: /?language=DE */]
would be the placeholder. As the actual information is placed within a comment this would be valid even without the replacement. The diacritics
would be the actual keyword here. Everything following by the :
would be an optional filter URL that is passed to the API.
This is just an idea, not set in stone. I'm open for other ideas. Anyone?
Okay, for these projects that aren't using a build they'd need to create a module that overwrites a method in their project by using the npm module in a build. There won't be a way to use the diacritics project without this module (or without a build) as the data is fetched dynamically from the API.
@Mottie What do you think?
Doesn't the API also need to indicate the type & format of the output?
/?diacritic=%C3%BC&output=base,decompose&format=string
output
would indicate which data entries to returnformat
should be either a string, array or objectI'm not yet clear on how we would get the API to only return the first equivalent, or a specific equivalent if there is one. Also that specified equivalent's specific data (e.g. unicode
).
In the case of the mark.js repo, if you added a placeholder for say u
using /?base=u
, we'd need the API to return a string of all equivalents[0].raw
to create the desired output of uùúûüůū
.
You're right, there should be some parameters to ignore some data, e.g. a value to ignore equivalents or just some of the equivalents (by name, e.g. unicode
). Ignoring either base or decompose makes so sense in my opinion, as both are optional and both are mapping information. Some diacritics have a base and no decompose and vice versa.
In the case of the mark.js repo, if you added a placeholder for say u using /?base=u, we'd need the API to return a string of all equivalents[0].raw to create the desired output of uùúûüůū.
Yes, that parameter format
wouldn't be part of the API in my opinion. This is something you need to specify in the placeholder, but is handled by the npm module.
In case of mark.js that would be an entire array, not just limited by e.g. u
.
I've thought about this again and making the format
parameter part of the API has one benefit: It would allow access to these formats outside the npm module. This is especially helpful if we want to show the code on the website, the array or the entire method. Users could then just copy and paste the code into their applications – which would be another good solution for projects without a build. So I'm open for this option.
If we introduce an option to specify the output structure (non-JSON) then this shouldn't be a parameter in my opinion (e.g. ?output=js-array). All the things under the route /
are currently generating JSON. So the cleanest thing would be to introduce a new route, e.g. /js-array/?language=DE
, where js-array
is the output structure and the parameters are just like for the /
route.
@Mottie What do you think?
Sorry for not responding earlier!
Do you still think that this will be necessary? I think the npm module would be able to output any format you need. I am starting to think that adding a new route would probably complicate the use of the API.
As an aside, I have started to work out what the npm module will provide and I've gotten stuck at how to deal with characters that are not going to be included under any language... like what happens when someone tried to remove diacritics from a crazy string like Iлtèrnåtïonɑlíƶatï߀ԉ
? So I think the solution would be to create a en
entry in the database that covers all the non-language specific diacritics. It's going to be huge.
Do you still think that this will be necessary? I think the npm module would be able to output any format you need. I am starting to think that adding a new route would probably complicate the use of the API.
I think it has one big advantage: It would allow copy and paste directly from the website. I currently imagine how our (later coming) website could look like. I imagine a website with a table full of diacritics, with their metadata and mapping information. You can filter and sort everything and finally you can just click "get code", select a language and structure (e.g. JavaScript and object or array) and get the code. This could be done using an option of the API. If it would be part of the npm module we would have to either create redundant code (npm module and website) or just don't implement such a button.
Interesting point in your second paragraph. Why would you call it "en"? And how would you map all kind of Unicode characters?
I think it has one big advantage: It would allow copy and paste directly from the website.
Ok, sounds good then!
Why would you call it "en"?
Well, English doesn't really include diacritics, and even when we do, we ignore them all the time. I didn't want to name it something like default
and make an exception in the spec. So, the format will follow the spec like all the other languages.
One block of entries will be removing all combining diacritics... so the base would be an empty string:
"data": {
// combining diacritics -> convert to empty string
"\u0301": { "mapping": { "base": "" } },
"\u0300": { "mapping": { "base": "" }},
"\u0306": { "mapping": { "base": "" }},
"\u0302": { "mapping": { "base": "" }},
"\u030c": { "mapping": { "base": "" }},
...
}
Then we could include the decomposing of other symbols like ①
and ⑴
into (1)
...
@Mottie I think that including special characters that aren't diacritics make sense (e.g. "①"), but we can't call them a "diacritic".
I'd say we should decide if we're going to include them depending on the effort. Is there any existing database like the one for the HTML entities? If so, then we can continue creating a new file in the build folder and adding them to the generated diacritics.json. A new API option should also allow excluding them. If there's no database and it's much effort I don't think we can continue with it, or at least not at the current time. In my opinion we should focus on creating the npm module hopefully before new year. If this takes too much time it may be better to discuss this if we have time. In that case I'd personally find it confusing to name it "en" if English doesn't contain diacritics. I think the cleanest would be to just create a single JSON file directly in the src folder.
Btw.: Is the "Write" tab (textarea) also delayed for you while typing?
I'm not having any issues with the textarea.
I started with a bunch of characters and plugged them into node-unidecode
which stripped out the diacritics and then added them to the data set... although some results ended up as [?]
. The list I was working on is no where near complete.
In the mean time, I'll put this part on hold and continue working on the npm module.
is no where near complete
How do you know that? And where did you take the data from?
Ping @Mottie. And what's the current status with the npm module spec?
Hey @julmot!
I'll clean up what I have in the works and post it in about 4 hours (after I go to the gym)... it is still incomplete, but it'll give you an idea of where things are now.
As I said, it's still a work-in-progress... https://github.com/diacritics/node-diacritics-transliterator/tree/initial
@Mottie Would you mind to submit a PR? This would allow us to have a conversation about it directly in the repository.
Finally, we're in the end-phase and going live soon.
The meaning behind this repository is to collect diacritics with their associated ASCII characters in a structured form. It should be the central place for various projects when it comes to diacritics mapping.
As there is no single, trustworthy and complete source, all information need to be collected by users manually.
Example mapping:
User Requirements
Someone using diacritics mapping information.
It should be possible to:
Contributor Requirements
Someone providing diacritics mapping information.
Assuming every contributor has a GitHub account and is familiar with Git.
Providing information should be:
System Specification
There are two ways of realization:
Create a database in a third-party service that fits the user and contributor requirements.
Tested:
Because we're not familiar with further third-party services that could fit user and contributor requirements, we'll continue realizing the first point.
System Requirements
See the documentation and pull request.
Build & Distribution
Build
According to the contributor requirements it should be possible to compile source files without making a Git clone necessary. This means that we can't require users to run e.g.
$ grunt dist
at the and, since this would require to clone, install dependencies and run things. What we'll do is implementing a build bot that will run our build on Travis CI and commits changes directly to adist
branch in this repository. Therefore once you merge something or you commit something yourself thedist
branch will be updated automatically. Some people already doing this to update theirgh-pages
branch when something changes in themaster
branch (e.g. this script).Since we'll use a server-side component to filter and serve actual mapping information we just need to generate one
diacritics.json
file containing all data.To make parsing easier and to encode diacritics to unicode numbers in production we're going to need a build that minifies the files and encodes diacritics. This should be done using Grunt.
Integrity
In order to ensure integrity and consistency we need the following in our build process:
Distribution
To provide diacritics mapping according to the User Requirements it's necessary to run a custom server-side component that makes it possible to sort, limit and filter information and output them in different ways (e.g. JS object or array). This component should be realized using Node.js as it's made for handling JS/JSON files and PHP would cause a lot more serializing/deserializing.
Next Steps
.md
file that specifies the entire database structure in detailThis comment is updated continuously during the discussion