PygmalionAI / data-toolbox

Our data munging code.
GNU Affero General Public License v3.0
34 stars 9 forks source link

feat: add normalization script #41

Closed AlpinDale closed 11 months ago

AlpinDale commented 11 months ago

This PR adds a script for Normalization Form C (NFC).

In NFC, characters are composed as much as possible. For example, in Unicode, an "e" with an acute accent (é) can be represented in two ways:

  1. As a single precomposed character (é): U+00E9
  2. As a combination of the letter "e" (U+0065) and the combining acute accent (U+0301)

The NFC script will transform the second form into the first, precomposed form.

I'm not entirely familiar with how this repo is structured, so I'm adding a standalone script to the scripts/ directory; its dependencies are added as optional dependencies in the pyproject file.

TearGosling commented 11 months ago

Added in the complete rewrite, meaning this PR is no longer necessary. Thanks for contributing!