bluelovers / ws-regexp

8 stars 0 forks source link

[@lazy-cjk/korean-romanize] Does not accept single character "words" #1

Closed Bas950 closed 3 days ago

Bas950 commented 3 weeks ago

Input: ㄱㄱㅎ

Error: Not a Hangul syllable: ㄱ

Expected: k k h or g g h or well something like this.

bluelovers commented 2 weeks ago

u can try new version now

Bas950 commented 2 weeks ago

Working with 50k+ lines of user input is always lovely xD

It still crashes on the following characters: Input Output
☆﹐﹑veevee˚ㆍ RangeError: Not a Hangul syllable: ㆍ, index: -31347 should not >= 11172
ㅔㅣ RangeError: Not a Hangul syllable: ㅔ, index: -31404 should not >= 11172
ㅑㅣㅗ므ㅛㅕㅇㅁ RangeError: Not a Hangul syllable: ㅑ, index: -31407 should not >= 11172
ㅇㅜㅇ RangeError: Not a Hangul syllable: ㅜ, index: -31396 should not >= 11172
ㅏㅏ RangeError: Not a Hangul syllable: ㅏ, index: -31409 should not >= 11172
Whiskers ˶˃ᆺ˂˶ RangeError: Not a Hangul syllable: ᆺ, index: -39494 should not >= 11172

For the first one, I don't know if I just need to make an if case on my side, or it's something you can do on your side.

All the other errors I got seem to be fixed in the latest version.

bluelovers commented 1 week ago

https://github.com/bluelovers/ws-regexp/issues/2#issuecomment-2324163187

bluelovers commented 1 week ago

https://github.com/bluelovers/ws-regexp/commit/7ff26d3ca5649c14491370d7cb4610bcc29900e1

Bas950 commented 1 week ago

7ff26d3

Heya, thanks for adding the options, I think I will be able to use stripUnSupported.

So for your info, in my project I need to Romanize many languages, including CJK, where I use your packages for, but they shouldn't all be Romanized at once as in your @lazy-cjk/slugify. This is because of other functions running in between and changelogs for each Romanization.

So currently I use @lazy-cjk/korean-romanize's romanize function for Korean, and with test the stripUnSupported option in a bit. For Japanese, I use @lazy-cjk/japanese's romanize function, but that one removes all non-japanese characters, which is not what I need. For Chinese, I use pinyin currently, if you have a better package for this, LET ME KNOW!

And as in your comment under the other issue, these Romanizations I need shouldn't touch any emojis or anything non-CJK. Those should stay untouched.

bluelovers commented 1 week ago

image

Bas950 commented 1 week ago

image

but they shouldn't all be Romanized at once as in your @lazy-cjk/slugify.

Won’t fit what I need, as I need them separately. Not CJK all at once. Cause I need to do Korean, do some changelog stuff, then Japanese, do some changelog stuff etc.

Bas950 commented 1 week ago

I mean its the right thing, I just need that function for each CJK separately

Bas950 commented 1 week ago

FYI: I have fixed most of the issues I was having by doing the following:

Korean:

import { romanize } from '@lazy-cjk/korean-romanize'

const kor = {
  function: (string: string) => romanize(string, { stripUnSupported: true }),
}

Japanese:

import { hiraganaRegex, katakanaRegex, romanize } from '@lazy-cjk/japanese'

const japaneseTextRegex = new RegExp(`(?:(?:${katakanaRegex.source})|(?:${hiraganaRegex.source}))+`, 'gu')

const jpn = {
  function: (text: string) => {
    text = text.replace(japaneseTextRegex, (string) => {
      const romanized = romanize(string)

      if (romanized !== '')
        return romanized

      /* c8 ignore next */
      return string
    })

    text = removeChineseJapanesePunctuation(text)

    return text
  },
}

Chinese:

import { pinyin } from 'pinyin'

const cmn = {
  function: (string: string) => {
    const romanizedWord = pinyin(string, {
      style: pinyin.STYLE_NORMAL,
      segment: true,
    })

    let newWord = ''
    for (const words of romanizedWord) newWord += `${words[0]}`
    newWord = removeChineseJapanesePunctuation(newWord)
    return newWord.trim()
  },
}

Chinese and Japanese both use this function:

import { romanizePuncutuationTable } from '@lazy-cjk/japanese'

export function removeChineseJapanesePunctuation(text: string): string {
  for (const [hiragana, romanized] of Object.entries({
    ...romanizePuncutuationTable,
    ',': ',',
  })) {
    text = text.replaceAll(hiragana, romanized)
  }
  return text
}
bluelovers commented 1 week ago

https://github.com/bluelovers/ws-regexp/commits?author=bluelovers&since=2024-09-03&until=2024-09-03

Bas950 commented 1 week ago

https://github.com/bluelovers/ws-regexp/commits?author=bluelovers&since=2024-09-03&until=2024-09-03

Fixed my issues in Japanese, one little weird issue tho:

import { romanize } from '@lazy-cjk/japanese'

const result = romanize('(C)ookieッ', { ignoreUnSupported: true })
// Expected (C)ookie'
console.log(result) // (C)ōkie'

You don't get the issue when doing

import { hiraganaRegex, katakanaRegex, romanize } from '@lazy-cjk/japanese'

const japaneseTextRegex = new RegExp(`(?:(?:${katakanaRegex.source})|(?:${hiraganaRegex.source}))+`, 'gu')

const text = '(C)ookieッ';
const result = text.replace(japaneseTextRegex, string => romanize(string, { ignoreUnSupported: true }));

console.log(result) // (C)ookie'

Also did you have any suggestions for Chinese, did you have a package for that in @lazy-cjk? Or should I continue using the pinyin npm package.

bluelovers commented 1 week ago

Also did you have any suggestions for Chinese, did you have a package for that in @lazy-cjk? Or should I continue using the pinyin npm package.

https://github.com/bluelovers/ws-regexp/blob/master/packages/%40lazy-cjk/slugify/lib/cjk/chinese.ts

Bas950 commented 3 days ago

Seems to be working well! Thanks for all the help!