BryceStevenWilley / visioning_texts

A D3 project that locally visualizes your messages from Signal or Whatsapp
GNU General Public License v3.0
37 stars 6 forks source link

[BUG] Emoji parsing not working from Facebook json file #28

Open Gusman10000 opened 4 years ago

Gusman10000 commented 4 years ago

Describe the bug Emoji's don't appear to be imported properly when importing a FB .json Message file, instead appearing as other odd unicode symbols

To Reproduce

  1. Import a Facebook .json file and view the results. For me nothing more needed to be done

Expected behavior .json file imported completely with all symbols being properly identified

Screenshots I've never sent this odd symbol (2nd down) in Messenger in my life. "It'd" has also been converted weirdly here too: Capture

Desktop:

Additional context Yesterday I was writing a parser in python for these .json files to convert them into a WhatsApp text file and I ran into this exact problem. Initially the code would convert the first byte of an emoji and ignore the rest.

In Python I found the fix for this would be:

def fixup_string(text): return text.encode('latin1').decode('utf8')

I'm not well versed in js, so I'm not sure what the translation would be. The screenshot below shows a simple example using content from a message I pulled from my .json of the issue in Python, as well as the solution:

Capture2

BryceStevenWilley commented 4 years ago

Interesting. I'll need a bit of time to get my own FB data to test this with. Thanks for the report and details though, it makes this a lot easier to approach.

htkcodes commented 4 years ago

Might be able to take a look at this. Going to try it

Gusman10000 commented 4 years ago

Did a little reading and found this method of encoding / decoding utf8 in js.

I'm not really sure of what I'm doing in js, but I tried adding the decode code in the math.js facebook import function in a few spots and had success with having the emoji's showing up by changing:

'BODY': msg.content to 'BODY': decodeURIComponent(escape(msg.content))

This seems to work as the emoji's now appear to register (I get the emoji map and they appear in the word use difference part).

That said I do get numbers and common symbols showing up in the word use difference, but they're things like 6, 10, *, &, 6:30, etc. Are any of these meant to be filtered out of this? If not then I think this change gets it working

BryceStevenWilley commented 4 years ago

Hey Gusman, I added your fix in https://github.com/BryceStevenWilley/visioning_texts/commit/167962724fe92d24c89ddac8b28eb0048ee96fab, thanks for the help! I'll double check that it works with my FB info, and close this issue when it does.

And at the moment, yeah, common numbers and symbols aren't filtered from the word difference. That's being tracked in #10.

BryceStevenWilley commented 4 years ago

Works for me, I've got some emoji's in the emoji count!

htkcodes commented 4 years ago

Hi guys, this method doesn't work for all emojis for example.

'\u00f3\u00be\u008c\u00ac\u00f3\u00be\u008c\u00ac\u00f3\u00be\u008c\u00a7'

Hence why i didn't bring it up earlier.

BryceStevenWilley commented 4 years ago

'\u00f3\u00be\u008c\u00ac\u00f3\u00be\u008c\u00ac\u00f3\u00be\u008c\u00a7'

Took me too long to figure out that this is 😘😘😍. Sorry for closing too soon @htkcodes.