Closed afbora closed 2 years ago
I noticed that the Sane class convert spaces in unicode 🤷♂️ My brain is burned with Sane/Dom/Parsley classes 🤯
@lukasbestle do you have an idea why spaces are converted?
The conversion most likely happens inside libxml
(PHP DOM
classes) between parsing and conversion back to a string.
Is the converted result incorrect? Or is it just about the unsaved changes loop?
Sounds to me like our setup (e.g. the Dom
class) has issues with encodings - like it receives A in encoding X, processes it and spits out B in encoding Y. This gets passed to the Panel and browser, where automatically it converts somehow to A in X again which is compared to B in Y - it differs, so we have the changes bar. And starting from top endlessly.
I don't know the Dom
class well, but I see some transformations and assumptions about very specific encodings:
https://github.com/getkirby/kirby/blob/develop/src/Toolkit/Dom.php#L73-L75
E.g. what irritates me there is that the comment says ISO-8859-1
is need, but the conversion is actually to HTML-ENTITIES
.
Maybe the cause is located somewhere around there?
E.g. what irritates me there is that the comment says
ISO-8859-1
is need, but the conversion is actually toHTML-ENTITIES
.
libxml
parses the HTML as ISO-8859-1
unless there is a <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
tag in the input HTML (the new shorter HTML5 variant of this is not supported). So unless we inject such a meta tag and remove it manually again, we need to input code in ISO-8859-1
to the loadHTML()
method. And that is done by converting everything to entities. As the entity names only use ASCII characters, the entities are valid in ISO-8859-1
and therefore parsed correctly. If we converted directly to ISO-8859-1
instead of to HTML-ENTITIES
, every UTF-8 char without an equivalent would be converted to a question mark, so content would get lost.
The "inject a meta tag" route would probably be the "better" solution, but also be a hack and I'm not sure if it will cause other issues, e.g. if there's already a meta tag in the input. So I decided for entity escaping as it's also suggested in the comments in the PHP docs.
@afbora I don't fully understand which part of your test case causes which issue. Could you convert your test case into a test that directly uses the Dom
class? I.e. an HTML string that is as short as possible but exhibits the wrong behavior when parsed and printed by the Dom
class.
If you want to load UTF-8 content, this has worked for us pretty well over the years:
$document->loadHTML('<?xml encoding="UTF-8">' . $html);
This is how we use it: https://github.com/hananils/kirby-tree-methods/blob/09d25aae7fadb403984c02282349b387299b4ea9/lib/tree.php#L39
Oh, that looks pretty nice. Thanks, I'll try it out!
@lukasbestle You can test following text on writer field:
FERNOX — jau 57 metus — patikimi, profesionalūs, aplinką tausojantys sprendimai jūsų šildymo sistemai
@lukasbestle FYI, also when you remove the Sane::sanitize()
, works great!
https://github.com/getkirby/kirby/blob/develop/config/fields/writer.php#L32
If I was able to get it right, @nilshoerrmann's suggestion didn't work :((
Saved without sane sanitize and no stucks ✅
<p>FERNOX — jau 57 metus — patikimi, profesionalūs, aplinką tausojantys sprendimai jūsų šildymo sistemai</p>
Saved with sane sanitize and stuck 💥
<p>FERNOX — jau 57 metus — patikimi, profesionalūs, aplinką tausojantys sprendimai jūsų šildymo sistemai</p>
I tried several options with the following result (simple test with just new Dom($string)->toString()
; input = expected result):
$this->doc->loadHTML($code)
--- Expected
+++ Actual
@@ @@
-'<html><body><p>TEST — jūsų šildymo sistemai</p></body></html>
+'<html><body><p>TEST — jūsų šildymo sistemai</p></body></html>
❌ Input is parsed as ISO-8859-1
, which completely destroys the Unicode multibyte sequences.
$this->doc->loadHTML(mb_convert_encoding($code, 'HTML-ENTITIES', 'UTF-8'))
--- Expected
+++ Actual
@@ @@
-'<html><body><p>TEST — jūsų šildymo sistemai</p></body></html>
+'<html><body><p>TEST — jūsų šildymo sistemai</p></body></html>
❌ As the input is encoded as entities, the output also is encoded as entities.
$this->doc->loadHTML('<?xml encoding="UTF-8">' . $code)
--- Expected
+++ Actual
@@ @@
-'<html><body><p>TEST — jūsų šildymo sistemai</p></body></html>
+'<?xml encoding="UTF-8"><html><body><p>TEST — jūsų šildymo sistemai</p></body></html>
❌ The input doesn't contain entities, but weirdly libxml
still converts everything to entities on output. Seems like it exports as ISO-8859-1
even though the XML declaration says the document is UTF-8
.
$this->doc->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $code)
--- Expected
+++ Actual
@@ @@
-'<html><body><p>TEST — jūsų šildymo sistemai</p></body></html>
+'<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body><p>TEST — jūsų šildymo sistemai</p></body></html>
✅ Promising, but we need to remove the added meta tag without removing any other tag that was already in the input.
I'm onto something here. Let's hope it will work. :)
I think the issue is that the writer field's handling of the value/text conflicts with that of the sane class. But I couldn't figure out.
Ok, I'll try to explain the underlying problem a bit to keep track of it here.
As soon as content gets loaded from the server to fill the form, a copy of it is stored in Vuex as original form values.
Whenever a form input updates, the updated value is also stored in Vuex in a second "changes" object. The original values object and the changes object are then compared against each other to find out of something has changed. In this case, the orange bar shows up.
This is the basic logic behind the changes bar.
The Writer initializes ProseMirror under the hood. ProseMirror receives the HTML from the server and turns it into its own JSON document format. In order to create this format, it has to be fed with all kinds of rules about allowed block level elements and inline elements. It will automatically do all the hard work and filter out unwanted shit in the HTML – kill attributes, styles, unwanted tags, etc. It is very strict when performing this task.
That JSON document format can then be turned back into clean HTML again. But that HTML might differ from the original HTML the Writer received. For example, when you give the Writer this:
<p>Some <b>Text</b></p>
The output after the clean-up would be:
<p>Some <strong>Text</strong></p>
This is actually really nice, but it also leads to a big problem.
When the form is loaded, the original value in Vuex would be the string from the server. We already clean up the HTML there with our Sane class and the clean up process is already very very close to what ProseMirror is doing, but it will never be 100% the same unless we'd be able to let ProseMirror run on the server, which is unfortunately not possible.
So the original already differs from the initial value once ProseMirror has been initalized. ProseMirror therefor already sends an input event and the changes object contains a value that's no longer the same than the original. That's why the changes bar appears immediately and can also never be fully removed.
After writing this all down, I might have found a solution in the PR above.
✅
Related https://github.com/getkirby/kirby/pull/3758 PR and still continue unicode characters issue.
You can use following content to reproduce the issue:
Kirby 3.6.0-beta.3