Prevent source content from getting entirely replaced by CKEditor5's normalized HTML output?

adamerose commented 1 month ago

I'm looking for advice on getting around the fact that CKEditor5 always normalizes source data.

The docs say that is core behavior that cannot be changed (link), but this is a big problem if the user or developer cares about the source, for example using CKE5 to edit pre-existing HTML or markdown content. My own use case is a VS Code extension for viewing and editing .md files.

Is there any guidance for use cases like this?

The problems are...

All non-renderable content is lost including , <script>, <meta>, and <style>
Syntax gets changed. eg Tag <em> becomes <i>. Markdown bullets - become *
All formatting and indentation gets lost

Is there any way CKEditor5 could be modified to optionally only normalize the chunks of source that were modified instead of the entire thing? If I was to solve this without rewriting the CKEditor5 internals I think I would have to do something like...

Run the CKE normalization on both the original source and editor output
Perform a diff between them to identify blocks that actually changed
Merge those changes back into the original source

But that seems like a fragile solution, so I'm hoping for feedback.

Witoso commented 1 month ago

Hi! This is very unlikely to happen I think, and creating something like this would be a huge task. Everything that editor parses is represented in an internal model structure. We don't have em or i we have attribute on a text node with italic. All of the features operate on this abstruct structure, and the output is just translating it to a desired format.

Run the CKE normalization on both the original source and editor output

Perform a diff between them to identify blocks that actually changed

Merge those changes back into the original source

This could be one of the solutions, but would make the getData operation even more heavy than it is today. Creating a diffing and merging heuristics would also be challenging for sure.

All non-renderable content is lost including , <script>, <meta>, and <style>

Have you tried features like HTML Comments, or Full Page? I'm not sure how would they behave with the markdown output TBH.

Syntax gets changed. eg Tag <em> becomes <i>. Markdown bullets - become *

All formatting and indentation gets lost

Is it the case of always outputting what was inputted, or the matter of preferences? Both editor API and markdown output could be configured in some way.

adamerose commented 3 weeks ago

Is it the case of always outputting what was inputted, or the matter of preferences?

The former. For example when using my plugin to just fix a single typo in a README.md, the entire file gets modified in unrelated/destructive ways. Changing - to *, removal of comments, and autoformatting are examples of that.

Have you tried features like HTML Comments, or Full Page? I'm not sure how would they behave with the markdown output TBH.

Markdown comments still get lost with HTML Comments, and Full Page breaks the rendering causing the entire source to render as a single paragraph element. The GeneralHtmlSupport feature also seems relevant.

This could be one of the solutions, but would make the getData operation even more heavy than it is today. Creating a diffing and merging heuristics would also be challenging for sure.

After some more thought it seems like solutions would fall into these categories:

Modify CKE5 fundamentally to render from raw source instead of maintaining its own internal structure.
- I'm assuming this is not viable
Let CKE5 normalize the entire output, and only merge minimum required parts of that back into the input.
- My diffing idea is this
Make CKE5's internal model support all HTML/markdown blocks and properties, so the conversion into CKE5s internal model and back is not lossy.
- From what I can tell it looks like that's how the Full Page and HTML Comment plugins work

And I'm thinking it might make more sense to try improving on the last category instead of the diffing? I had some questions about this...

Could the problem of losing source formatting be solved by checking the leading space in front of each element when parsing input to create the internal model, and then just storing it as a property on the model node similar to how element properties like id, class, and data-* get saved under htmlPAttributes in the Model when the GeneralHtmlSupport plugin is enabled?
Could you similarly store details like original tag type as an attribute? So <em> and <i> would both still become <paragraph> with the italic: true attribute, but also have another attribute like tag: "em".
How would you approach getting this to work with markdown as well as HTML?
I noticed that the HTML Comments and Full Page plugins encode the contents and positions of comments and non-renderable HTML elements into the root element. I'm curious why was that method chosen instead of just adding invisible <comment> or <meta> nodes stored in the Model tree alongside other nodes like <paragraph> ?

ckeditor / ckeditor5

Prevent source content from getting entirely replaced by CKEditor5's normalized HTML output? #16834