fivethirtyeight / russian-troll-tweets

770 stars 215 forks source link

Emojis in MySQL/MariaDB #20

Open chrisgherbert opened 6 years ago

chrisgherbert commented 6 years ago

Has anyone gotten the emojis to work when the data is loaded into a MySQL or MariaDB database? I'm using utf8mb4 encoding and utf8mb4_unicode_ci collation, but only a small portion of the emojis are displaying properly for me.

georgedumontier commented 6 years ago

I think so? I did the same collation on my mysql db, but I haven't noticed any messed up emojis. Can you give me an example of one of the tweets you noticed was broken? I'd like to check mine.

chrisgherbert commented 6 years ago

Sure, here are a couple examples:

author: 4EVER_SUSAN publish_date: 12/9/2015 21:26 content: �Today's the day! My limited edition @maccosmetics lipstick "Von Teese" is now on sale:… https://t.co/EjWdMcyNke

author: 6DRUZ publish_date: 11/3/2016 20:36 content: Live to learn, Mario �� Hardwork really pays off �� @MariaSharapova #inspiringchildren https://t.co/0z09uExQqZ

Really wish these tweets had IDs.

georgedumontier commented 6 years ago

Ah yeah, I've got the same problem...

It might be an issue with the original data. Looks like it's the same in the csv? The first tweet is line 84788 in the first csv. It's just the unicode replacement character there too.

EvanCarroll commented 6 years ago

This isn't a MySQL or MariaDB issue, I'm facing the same problem with PostgreSQL. It's malformed characters.

EvanCarroll commented 6 years ago

Thanks for the help tracking the issue down @chrisgherbert . The issue there is the emoji is the 👏🏻 Clapping Hands: Light Skin Tone. You can figure that out with a hexdump on the chars from the tweet

That actually byte code f0 9f 91 8f f0 9f 8f bb.

Looking at the bytes in the stream I see, 20 ef bf bd ef bf bd 20 You can see that's massively different which means this is likely another encoding error.

EvanCarroll commented 6 years ago

An important thing if you're struggling parsing these Unicode characters in my fork of the repository they're encoded as U+FFFD � REPLACEMENT CHARACTER. This is because we can't do anything with these corrupt Unicode characters.