daniel-sc / ng-extract-i18n-merge

Extract and merge i18n xliff translation files for angular projects.
MIT License
179 stars 19 forks source link

When collapsing whitespace retain things like no-break space #88

Closed michaelwildvarian closed 1 year ago

michaelwildvarian commented 1 year ago

When the source code contains "special" space characters, such as no-break space (\u00a0), the current regex (\s+) replaces everything with a single space character (\u0020).

By following an approach that resembles more closely to the whitespace-collapsing implemented by browsers, this can be improved (see e.g. https://developer.mozilla.org/en-US/docs/Web/CSS/white-space-collapse#collapsing_of_white_space).

This PR changes the doCollapseWhitespace() function to transform the input string in the following way:

  1. Sequences of tab characters are replaced by a single space.
  2. Sequences of segment breaks (here interpreted to be line breaks) are replaced by a single space.
  3. Space characters before and after a "space-like but not space" character (e.g. no-break space \u00a0 or zero-width-no-break-space \ufeff) are removed.
  4. Sequences of space characters are collapsed to a single space.

Using this method, the source string hello, \u00a0 world ! becomes hello,\u00a0world! and when being displayed in the browser will not break between hello, and world!.

The tests in builder.spec.ts have been extended to include strings with \u00a0.

daniel-sc commented 1 year ago

Hi @michaelwildvarian thanks for this elaborate PR!!

I have some general questions/remarks (probably due to lack of knowledge on my side!):

If you could address these points, I'd be grateful!

michaelwildvarian commented 1 year ago

Thanks for getting back to me.

  1. When extracting text from the source that contains \u00a0 (or   in an component template), then we'd like that to be retained because it's there for a reason. E.g. if we have values with units you don't want the unit to wrap to the next line. Our setup is Angular 17. I don't think that ViewEngine/Ivy has anything to do with our problem, it is literally the function I modified that indiscriminately collapses everything that matches \s+ into a single space character. When setting collapseWhitespace=false this effect doesn't happen, with the side-effect of having now excessive whitespace in the XLIFF files. Also, what the value of the CSS property white-space-collapse is should not factor in here.
  2. The whitespace collapsing implemented by this package is indeed useful, because when I disabled collapseWhitespace the extracted text did preserve the no-break space, but then contained excessive whitespace stemming from indentations. So I'd definitely want to keep it, just make it less eager to gobble everything up that looks like space.
  3. You're right, I just tried it with Chrome. I'm a too aggressive with the purging before and after no-space-whitespace. Reading the referenced MDN page something like Hello\u0020\u0020\u00a0\u0020\u0020World should be converted to Hello\u0020\u00a0\u0020World, leaving a single space before and after the no-break space. But then, I'd consider that to be an error on behalf of the creator of such a string, as having no-break space surrounded by normal space very much defeats the purpose of the no-break space. And it renders very ugly indeed.

So, should I adapt the collapsing to this effect?

daniel-sc commented 1 year ago

@michaelwildvarian

  1. Ok, I understand your problem and this indeed is a but that should be fixed (with this PR)!

  2. I tried to reproduce it, and for me the default settings with Angular 16.2 do collapse whitespaces in extracted i18n texts - e.g. the following template:

    <div i18n="@@WITH_SUPERFLOUS_WHITESPACES">
    some    text\u00a0after-non-break   \u00a0    after non-break with whitespace   &nbsp;     after NBSP with ws
    </div>

    leads to the following extraction (without any processing by this lib!):

    <segment>
    <source> some text\u00a0after-non-break \u00a0 after non-break with whitespace   after NBSP with ws </source>
    </segment>

    Note the &nbsp; is converted by angular to \u00a0. Is it possible you set preserverWhitespace: false in your comonent - see https://angular.io/api/core/Component#preserveWhitespaces ?

  3. Yes, I think just replacing whitespace wrt to the Browser interpretation would be best - e.g. replace(/(\n|\r| )+/g, ' ')

michaelwildvarian commented 1 year ago

Sorry for being AWOL, was pretty swamped by work...

Regarding 2: I've created a sample project and when using $localize in the TypeScript code of the component, and there it definitely doesn't "pre-collapse" space, but does so when it appears in the HTML template of the component.

Also, I'd really want to keep the collapseWhitespace enabled, because some of our strings are wrapped and indented by prettier in the TypeScript code, and that makes for some really ugly original strings in the XLIFF that the translators will hate us for.

Regarding 3: Should we through \t in there as well? And I'd make it into a character group, i.e. replace(/[\n\r\t ]+/g, ' ') to avoid making an unnecessary capturing group.