Ayemae / Grawlix-Webcomic-CMS

PHP-based webcomic CMS
22 stars 5 forks source link

Using unicode emoji in blog posts deletes remaining text input? #30

Open ksangwin opened 4 days ago

ksangwin commented 4 days ago

Hey, I just edited a blog post by adding a short message to the start. After the first sentence, I input the unicode emoji "🫠". It seemed to display correctly in the text field, but after submitting, I found that all text that came after that unicode emoji was deleted (ie, 99% of my blog post).

I suppose I'm not surprised that blog posts don't support emojis (although it would be a nice quality of life improvement if they did), but I'm a little miffed that I've lost 2 long blog posts this way. I kinda wish Grawlix wouldn't just drop the entire remainder of my post? I would have expected the emoji to display an unprintable character or &#129760 while still leaving the rest of the post in tact.

At least, I'm assuming it's the emoji that's been the issue. When I view the post and go to edit it again, I see the last character in the Post content is the whitespace before the emoji.

Possibly unrelated, but I've noticed some of my longer blog titles have been getting truncated as well. I'm assuming I'm hitting a character limit? Or maybe I'm typing some special character that it doesn't like? Either way, that hasn't been a great user experience when I go to make an announcement and find half my headline is gone.

Would it be worthwhile to add some kind of preview feature for new blog posts? At least that way, I could see ahead of time if my inputs aren't going to be accepted and I'll have chance to fix them before it goes live.

eishiya commented 3 days ago

The problem seems to be that the database is set yo use utf8mb3 encoding (it uses "utf8", which is an alias for "utf8mb3"), which is a maximum of 3 bytes per character, but emoji and other characters outside the Basic Multilingual Plane require 4 bytes. We would need to change the database to use "utf8mb4" encoding instead if we want to support storing emoji in the database. A change to the database means a new firstrun script and a new upgrade script, so it's something we're putting off, it would be a significant breaking change.

An alternative fix that doesn't require database changes would be to have Grawlix replace 4-byte characters with their corresponding HTML entities, e.g. 🫠 for 🫠 as you suggested. This would be pretty ugly when you want to edit the post (unless we resolve HTML entities back to characters, which is a whole other can of worms), but better than breaking things. On the other hand, it adds a lot of string processing that shouldn't really be necessary, and Grawlix already applies HTML input sanitisation...

utf8mb3 is deprecated, so a switch to utf8mb4 will be required eventually anyway, so I think a change to the database is the best route to take.

In the interim, you can use emoji as HTML entities yourself, BUT they'll be displayed as actual emoji when you edit the blog post, and unless you replace them with HTML entities again every time you edit, they'll get eaten 🙃

Regarding post titles: There is a 64-character limit, as that's the length of the field in the database. This would also require changing the database, so it's something we'd have to save for when we're ready to do a breaking change. Please consider opening an issue for that so we don't forget! In the interim, it would be nice to at least limit the text field in the admin panel to those same 64 characters (not as easy as you might think, again due to encoding issues - the database accepts a maximum of 64 bytes, which isn't the same as 64 characters). And a related tidbit, in case it's of interest: The blog post text is stored as type TEXT, which means it allows a maximum length of 65535 bytes (which again, may mean fewer characters than that).