andrewrk / node-diacritics

remove diacritics from strings ("ascii folding") - Node.js module
MIT License
263 stars 32 forks source link

.find() method to retrieve all the group of diacritics from a specific char #20

Open DiegoZoracKy opened 8 years ago

DiegoZoracKy commented 8 years ago

Hi @andrewrk,

What do you think about this? Right now i'm facing a case where i need to have a group of all possible diacritics from a specific char. I remembered about your great list of diacritics, and that your package is named as 'diacritics', and not something like 'remove-diacritics', so i thought that would be better to extend it with one more method instead of create another package.

I already created the new method:

function findDiacritics(chr) {

  var diacriticsFound = replacementList.find( o => o.base == chr || o.chars.indexOf(chr) >= 0 );
  return (diacriticsFound)? diacriticsFound.base + diacriticsFound.chars : null;

}

If you think it is ok, i can send you a pull request.

thejoshwolfe commented 8 years ago

what about just exporting replacementList? Then you can this search in your code, or any other search you might want to do.

DiegoZoracKy commented 8 years ago

Exporting replacementList would be good too. But just with the list, me, and other developers working on a similar case, would have to create this same code.

Is the same goal of the remove method, instead of just having the list, you have created the method to help. So i thought that it could be good to have this helper in this package. But it's ok if you don't agreed. Do you think that you will update it to export the replacementList ?

thejoshwolfe commented 8 years ago

i'll have to defer to @andrewrk on this, but in my own opinion, i have to admit, i don't really understand what the function is supposed to be used for. In particular, you lose some information when you concatenate 'AE' with "\u00C6\u01FC\u01E2". What are you going to do with the "group of all possible diacritics" when you get it? If I were going to write documentation for this function, I'd be at a loss to describe what it really does without just describing the code.

Can you give more information on the usecase for this function?

DiegoZoracKy commented 8 years ago

To make an diacritic insensitive RegExp. Example: I have a text which contains the word 'ação'. Assuming that we are handling some kind of search engine, where the input could be written correctly as 'ação', but also it can have a typo like 'açao', 'acão', etc.

By having the group of diacritics i can easily create a RegExp like: /a[ccćĉċčçḉƈȼↄ][aaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ]o/i

thejoshwolfe commented 8 years ago

did you mean /[aaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][ccćĉċčçḉƈȼↄ][aaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][oⓞoòóôồốỗổõṍȭṏōṑṓŏȯȱöȫỏőǒȍȏơờớỡởợọộǫǭøǿꝋꝍɵɔᴑ]/i? It looks like the function is prepared to look up simple ascii characters as well (o.base == chr).

isn't there a problem with multi-char diacritics like 'Æ'? Wouldn't the regex for "Cæsar" fail to match against the string "Caesar"?

thejoshwolfe commented 8 years ago

how about this function:

function charToRegexPattern(chr) {
  for (var i = 0; i < replacementList.length; i++) {
    var replacement = replacementList[i];
    if (replacement.chars.indexOf(chr) === -1) continue;
    if (replacement.base.length > 1) {
      // allow the complete multi-char sequence or a literal diacritic character
      return '(?:' + replacement.base + '|[' + replacement.chars + '])';
    } else {
      // allow the ascii char or a literal diacritic character
      return '[' + replacement.base + replacement.chars + ']';
    }
  }
  // either already ascii or not a diacritic char
  return chr;
}

It's arguably less "general purpose", since it returns strings formatted for regex, but i think it's the only way to make it actually work for multi-char sequences, like "ae".

DiegoZoracKy commented 8 years ago

Yes @thejoshwolfe, i meant exactly like you said on the first RegExp. I just kept it short to give you a simple example.

With the version that i wrote i would use in a case like this:

function toRegExp(str){
    return RegExp(str.split('').map(chr => `[${diacritics.find(chr) || chr}]`).join(''), 'gi');
}

let str = 'acaoae1ae';
let strDiacritic = 'açãoae1æ';

// RegExp will be: /[aⓐaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][ccⓒćĉċčçḉƈȼꜿↄ][aⓐaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][oⓞoòóôồốỗổõṍȭṏōṑṓŏȯȱöȫỏőǒȍȏơờớỡởợọộǫǭøǿꝋꝍɵɔᴑ][aⓐaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][eⓔeèéêềếễểẽēḕḗĕėëẻěȅȇẹệȩḝęḙḛɇǝ][1][aeæǽǣ]/gi
// And "str" it will match "strDiacritic"
str.match(toRegExp(strDiacritic))

See that the expected input can be a diacritic, or a base char, while in your charToRegexPattern you expects only a diacritic. The base char would never be "expanded" so it won't work in my example where the input 'acao' should match 'ação'. I wouldn't be able to know what is the possible diacritic for a base char.

And yes, this version is not handling the input of a diacritic of length > 1.