Closed raboof closed 1 month ago
Hey, thank you very much! 🙂
I'll have a closer look later, but a quick question first: This is meant that I get the UTF-8 character from a HTML description and not the other way round, right?
yes, exactly - these descriptions are formally known as 'Named character references' but that seemed confusing as well, so I shortened it to just 'characters'
Thanks, I like it 🙂
However, I've found https://html.spec.whatwg.org/entities.json , which is supposed to be the same data as JSON. I would prefer to parse that instead of working a large HTML file with a regex. Could you have a look at that, or should I?
Also, many "named character references" seem to be duplicated in that file, once with semicolon and once without. I'd say we strip the semicolon and de-duplicate the rest. What do you think?
Thanks, I like it 🙂
However, I've found https://html.spec.whatwg.org/entities.json , which is supposed to be the same data as JSON. I would prefer to parse that instead of working a large HTML file with a regex. Could you have a look at that, or should I?
Ah that's of course much nicer, will do. Do you have a hint on how to invoke the extractor? Python is not my native language ;)
Also, many "named character references" seem to be duplicated in that file, once with semicolon and once without. I'd say we strip the semicolon and de-duplicate the rest. What do you think?
Yeah in the regex I just ignored the ones with the semicolon - I'll see what the json looks like
Ah that's of course much nicer, will do. Do you have a hint on how to invoke the extractor? Python is not my native language ;)
I didn't notice; it looks fine as it is 😉 But I'll check it out when you're done and correct any mistakes that might be left.
Ah that's of course much nicer, will do. Do you have a hint on how to invoke the extractor? Python is not my native language ;)
I didn't notice; it looks fine as it is 😉 But I'll check it out when you're done and correct any mistakes that might be left.
Interestingly I don't have much trouble with the coding part, and it's mainly the tooling part that keeps tripping me up. Figured out python -m extractors
works :)
I updated the extractor to use the json, and noticed I wasn't treating multi-codepoint characters correctly yet. I fixed it by tweaking Character
, so that might be worth a closer look.
Perfect, thank you 🙂
like 'eacute' etc
I didn't know how to run the extractor so that bit is untested (the csv is from an earlier version of the script)