LiFaytheGoblin / Gender-Equality-in-CS-Publications

Scripts I used for my analysis of gender equality in computer science publications.
5 stars 1 forks source link

Parsing XML file at start is not elegant #1

Open LiFaytheGoblin opened 5 years ago

LiFaytheGoblin commented 5 years ago

It could be more elegant if it didn't have to use the dictionary to transform something like ç to an actual c-cedille, but if the parser used the available DTD file that comes with the XML file. It might work with sth like:

from xml.sax.saxutils import unescape
unescape(“< & >“)
# returns ‘< & >’

Or maybe with lxml library: https://lxml.de/validation.html#id1

Or with BeautifulSoup: from bs4.dammit import EntitySubstitution, EntitySubstitution.substitute_html Tutorial: http://www2.hawaii.edu/~takebaya/cent110/xml_parse/xml_parse.html More info: https://stackoverflow.com/questions/29799542/how-to-retain-quot-and-apos-while-parsing-xml-using-bs4-python