Whitespaces getting trimmed when I try to render from convertFromHtml

vikasrungta92 commented 3 years ago

Do you want to request a feature or report a bug?

Bug/Feature

What is the current behaviour?

My application heavily depends on DraftJS. I am generating the editorState using convertFromHtml API. I am passing in the HTML text which contains the whitespaces at both end of the text.

Following are behaviour observed:(consider '-' as a whitespace)

When I pass the HTML : ---This is Amazing.--- Within content blocks I can see the text value as : "---This is Amazing.---" => as expected Demo: link
Passed HTML: ---This is Amazing.--- (Amazing is encapsulated with em tag) Within content blocks I can see the text value as : "---This is Amazing.---" => as expected Demo: link
But, When the passed HTML is : ---This is Amazing.--- (This is encapsulated with em tag) Text in content block is shown as: ''This is Amazing." => Which is incorrect Demo: link That is, when we pass the html where the just after the whitespace, HTML tag element is present then all the whitespaces are dropped.

Note: That in demo, one whitespace is getting dropped each time.

I am not sure why the whitespaces are getting dropped from HTML.

Scenarios Tested but no luck:

Replaced the leading and trailing whitespaces with &nbdp;
Used "white-space: pre" in CSS.
Used pre HTML tag.

Can anyone please guide me if I am missing anything here? It is important for me to retain these whitespaces as the content/HTML which is passed are critical content which we cannot drop.

Or, is there any workaround I can use to make it work?

Version Used: Chrome: 88.0.4324.150 DraftJS: 0.10.x Mac/Windows: latest

Thanks in advance.

vikasrungta92 commented 3 years ago

Any update on the issue please.

vikasrungta92 commented 3 years ago

Any update of this issue??

BertrandBordage commented 1 year ago

I created a workaround that may only work in certain cases. In my case it is for preserving leading spaces, specifically for Draftail by @thibaudcolas, an integration of Draft.js to Wagtail.

The workaround is to modify the HTML on the backend side:

replace successions of mixed spaces, \n and non-breaking spaces into non-breaking spaces only
add a zero-width space before groups of non-breaking spaces. The zero-width spaces are considered as a non-space character, but are invisible and take no width.

In my case I wanted to only preserve leading whitespaces that contain at least one NBSP, so I created this function that I use every time I save HTML. It has to be adjusted to fit other uses cases, but the idea of normalizing spaces + inserting a zero-width space could work for you too.

import re

BREAKING_SPACES_AFTER_NEWLINE_RE = re.compile(r'\n([ \t]+)')
SUCCESSIVE_BREAKING_SPACES_RE = re.compile(r' +(<[^>]+?>)? +')
BREAKING_SPACE_AFTER_BR = re.compile(r'(<br\s*/>) ')

def normalize_whitespaces(html: str) -> str:
    # Rules taken from https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Whitespace
    html = BREAKING_SPACES_AFTER_NEWLINE_RE.sub(r'', html)
    html = html.replace('\t', ' ').replace('\n', ' ')
    html = SUCCESSIVE_BREAKING_SPACES_RE.sub(r' \1', html)
    html = BREAKING_SPACE_AFTER_BR.sub(r'\1', html)
    return html

# Breaking spaces at the beginning of the string or after an HTML tag.
LEADING_BREAKING_SPACE_RE = re.compile(r' (?=[\u00A0\u202F])')
TRAILING_OR_NESTED_BREAKING_SPACE_RE = re.compile(r'(?<=[\u00A0\u202F]) ')
NBSP_GROUP_RE = re.compile(r'(?<![\u200B\u00A0\u202F])([\u00A0\u202F]+)')

def fix_draftail_leading_trailing_whitespaces(html: str) -> str:
    html = normalize_whitespaces(html)
    # We force successions of breaking & non-breaking spaces to be converted
    # into non-breaking spaces only, otherwise Draftail (maybe even Draft.js)
    # ignores all spaces.
    html = LEADING_BREAKING_SPACE_RE.sub('\u00A0', html)
    html = TRAILING_OR_NESTED_BREAKING_SPACE_RE.sub('\u00A0', html)

    # We add a zero-width space before groups of non-breaking spaces.
    # This is a workaround because Draftail (maybe even Draft.js) ignores
    # leading non-breaking spaces.
    # Zero-width spaces are not considered as spaces by most libraries.
    # This is why this trick works for preserving leading non-breaking spaces.
    # Those zero-width spaces can be removed from the data at any point,
    # they do not mean anything in terms of data.
    zero_width_space = '\u200B'
    return NBSP_GROUP_RE.sub(fr'{zero_width_space}\1', html)

I could finally find a use for my knowledge in weird Unicode characters :laughing:

facebookarchive / draft-js

Whitespaces getting trimmed when I try to render from convertFromHtml #2830

Do you want to request a feature or report a bug?

What is the current behaviour?