Open chrisgherbert opened 6 years ago
I think so? I did the same collation on my mysql db, but I haven't noticed any messed up emojis. Can you give me an example of one of the tweets you noticed was broken? I'd like to check mine.
Sure, here are a couple examples:
author: 4EVER_SUSAN publish_date: 12/9/2015 21:26 content: �Today's the day! My limited edition @maccosmetics lipstick "Von Teese" is now on sale:… https://t.co/EjWdMcyNke
author: 6DRUZ publish_date: 11/3/2016 20:36 content: Live to learn, Mario �� Hardwork really pays off �� @MariaSharapova #inspiringchildren https://t.co/0z09uExQqZ
Really wish these tweets had IDs.
Ah yeah, I've got the same problem...
It might be an issue with the original data. Looks like it's the same in the csv? The first tweet is line 84788 in the first csv. It's just the unicode replacement character there too.
This isn't a MySQL or MariaDB issue, I'm facing the same problem with PostgreSQL. It's malformed characters.
Thanks for the help tracking the issue down @chrisgherbert . The issue there is the emoji is the 👏🏻 Clapping Hands: Light Skin Tone. You can figure that out with a hexdump on the chars from the tweet
That actually byte code f0 9f 91 8f f0 9f 8f bb
.
Looking at the bytes in the stream I see, 20 ef bf bd ef bf bd 20
You can see that's massively different which means this is likely another encoding error.
An important thing if you're struggling parsing these Unicode characters in my fork of the repository they're encoded as U+FFFD � REPLACEMENT CHARACTER
. This is because we can't do anything with these corrupt Unicode characters.
Has anyone gotten the emojis to work when the data is loaded into a MySQL or MariaDB database? I'm using utf8mb4 encoding and utf8mb4_unicode_ci collation, but only a small portion of the emojis are displaying properly for me.