SuLab / WikidataIntegrator

A Wikidata Python module integrating the MediaWiki API and the Wikidata SPARQL endpoint
MIT License
244 stars 46 forks source link

HTML tags in titles #197

Open Adafede opened 1 year ago

Adafede commented 1 year ago

Hi,

It is multiple times I had the remark from the WD community I had to sanitize the titles I am importing through WikidataIntegrator (see https://www.wikidata.org/wiki/User_talk:AdrianoRutz), before trying to fix it downstream, would there be a solution to implement here directly to format them better?

andrawaag commented 1 year ago

Are the HTML titles introduced by the wdi, or are they sourced from a primary source? Do you have an example and a pointer to the source to see if I can reproduce the issue? I am still on the fence about simply adding a strip function or leaving it to the source to fix.

Adafede commented 1 year ago

An example would be https://doi.org/10.1002/ejoc.201402609 (https://www.wikidata.org/wiki/Q114865259). It is in the source with the HTML tags: https://api.crossref.org/v1/works/10.1002/ejoc.201402609

Problem is that some replacements exist for chemistry for example (₁₂₃), so top of the top would be adding them as for molecular formulas (https://www.wikidata.org/wiki/Property:P274), but the sub/sup are not limited to it, and the current example also has <i> tags, which then lead to a missing space... a nightmare, I know.

I just thought given the users of Wikidataintegrator, better report upstream than do a

cleantext = BeautifulSoup("Mild, Stereoselective, and Highly Efficient Synthesis of<i>N</i>-Acylhydrazones Mediated by CeCl<sub>3</sub>·7H<sub>2</sub>O in a Broad Range of Solvents", "lxml").text

on my side.

I understand (and probably share) your point of view, but we should then make it understandable to other wiki members...

andrawaag commented 1 year ago

Revisting this after starting the discussion in the telegram channel. I wonder if, by your suggestion, changing https://github.com/SuLab/WikidataIntegrator/blob/main/wikidataintegrator/wdi_helpers/publication.py#L106 to

self.title = BeautifulSoup(title, "lxml").text

would not fix this issue. I am currently travelling and I want to give it a bit more attention, but will dive in upon returning to the office by the end of this week.

Adafede commented 1 year ago

As discussed, I was also trying to clean Crossref titles from html tags as requested by WD. Here are some challenging tests:

https://github.com/lotusnprod/lotus-wikidata-interact/blob/cfffc1e7c8f4210f9dd7fd506b14110da1ba1c1c/wdkt/src/test/kotlin/wd/WDArticleTest.kt#L32-L43

For now, I could not succesfully clean all tests using simple JSoup cleaning.