WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
240 stars 194 forks source link

Get titles/text with emojis #1550

Open krysal opened 2 years ago

krysal commented 2 years ago

Current Situation

In the WordPress Photo Directory script, we're getting some HTML content that the lxml.html module is failing to parse correctly because it contains emojis.

Logs

[2022-05-20 16:47:49,268] {wordpress.py:258} WARNING - Can't save the image's title ('<p>Tomato Basil 🌿 Soup</p>
') due to 'utf-8' codec can't decode byte 0xf3 in position 61: unexpected end of data

...

[2022-05-20 16:48:02,313] {wordpress.py:258} WARNING - Can't save the image's title ('<p>Macro 🌸 in spring time.</p>
') due to 'utf-8' codec can't decode byte 0xf3 in position 33: unexpected end of data

...

[2022-05-20 16:49:52,761] {wordpress.py:252} WARNING - Can't save the image's title ('<p>Strawberry πŸ“ Apples 🍏 Orange 🍊 Lemon πŸ‹</p>
') due to 'utf-8' codec can't decode byte 0xf3 in position 53: unexpected end of data

Benefit

It would be nice to save the original text (without HTML tags), which includes emojis πŸ“₯

Implementation

PrathamSoneja commented 2 years ago

Hey, I'd like to work on this issue. Can I get started?

krysal commented 2 years ago

@PrathamSoneja Sure! Thanks for working in the Openverse Catalog πŸ˜„ I'm assigning the issue to you. You can also contact us in the Making WordPress slack @ the #openverse channel if you have questions.

PrathamSoneja commented 2 years ago

@krysal Sure! Thank you