Add HTML characters - Githubissues

fdw / rofimoji

Emoji, unicode and general character picker for rofi and rofi-likes

https://github.com/fdw/rofimoji

MIT License

847 stars 48 forks source link

Add HTML characters #195

Closed raboof closed 1 month ago

raboof commented 1 month ago

like 'eacute' etc

I didn't know how to run the extractor so that bit is untested (the csv is from an earlier version of the script)

fdw commented 1 month ago

Hey, thank you very much! 🙂

I'll have a closer look later, but a quick question first: This is meant that I get the UTF-8 character from a HTML description and not the other way round, right?

raboof commented 1 month ago

yes, exactly - these descriptions are formally known as 'Named character references' but that seemed confusing as well, so I shortened it to just 'characters'

fdw commented 1 month ago

Thanks, I like it 🙂

However, I've found https://html.spec.whatwg.org/entities.json , which is supposed to be the same data as JSON. I would prefer to parse that instead of working a large HTML file with a regex. Could you have a look at that, or should I?

Also, many "named character references" seem to be duplicated in that file, once with semicolon and once without. I'd say we strip the semicolon and de-duplicate the rest. What do you think?

raboof commented 1 month ago

Thanks, I like it 🙂

However, I've found https://html.spec.whatwg.org/entities.json , which is supposed to be the same data as JSON. I would prefer to parse that instead of working a large HTML file with a regex. Could you have a look at that, or should I?

Ah that's of course much nicer, will do. Do you have a hint on how to invoke the extractor? Python is not my native language ;)

Also, many "named character references" seem to be duplicated in that file, once with semicolon and once without. I'd say we strip the semicolon and de-duplicate the rest. What do you think?

Yeah in the regex I just ignored the ones with the semicolon - I'll see what the json looks like

fdw commented 1 month ago

Ah that's of course much nicer, will do. Do you have a hint on how to invoke the extractor? Python is not my native language ;)

I didn't notice; it looks fine as it is 😉 But I'll check it out when you're done and correct any mistakes that might be left.

raboof commented 1 month ago

Ah that's of course much nicer, will do. Do you have a hint on how to invoke the extractor? Python is not my native language ;)

I didn't notice; it looks fine as it is 😉 But I'll check it out when you're done and correct any mistakes that might be left.

Interestingly I don't have much trouble with the coding part, and it's mainly the tooling part that keeps tripping me up. Figured out python -m extractors works :)

I updated the extractor to use the json, and noticed I wasn't treating multi-codepoint characters correctly yet. I fixed it by tweaking Character, so that might be worth a closer look.

fdw commented 1 month ago

Perfect, thank you 🙂