Open Adafede opened 1 year ago
Are the HTML titles introduced by the wdi, or are they sourced from a primary source? Do you have an example and a pointer to the source to see if I can reproduce the issue? I am still on the fence about simply adding a strip function or leaving it to the source to fix.
An example would be https://doi.org/10.1002/ejoc.201402609 (https://www.wikidata.org/wiki/Q114865259). It is in the source with the HTML tags: https://api.crossref.org/v1/works/10.1002/ejoc.201402609
Problem is that some replacements exist for chemistry for example (₁₂₃), so top of the top would be adding them as for molecular formulas (https://www.wikidata.org/wiki/Property:P274), but the sub/sup are not limited to it, and the current example also has <i>
tags, which then lead to a missing space... a nightmare, I know.
I just thought given the users of Wikidataintegrator, better report upstream than do a
cleantext = BeautifulSoup("Mild, Stereoselective, and Highly Efficient Synthesis of<i>N</i>-Acylhydrazones Mediated by CeCl<sub>3</sub>·7H<sub>2</sub>O in a Broad Range of Solvents", "lxml").text
on my side.
I understand (and probably share) your point of view, but we should then make it understandable to other wiki members...
Revisting this after starting the discussion in the telegram channel. I wonder if, by your suggestion, changing https://github.com/SuLab/WikidataIntegrator/blob/main/wikidataintegrator/wdi_helpers/publication.py#L106 to
self.title = BeautifulSoup(title, "lxml").text
would not fix this issue. I am currently travelling and I want to give it a bit more attention, but will dive in upon returning to the office by the end of this week.
As discussed, I was also trying to clean Crossref titles from html tags as requested by WD. Here are some challenging tests:
For now, I could not succesfully clean all tests using simple JSoup cleaning.
Hi,
It is multiple times I had the remark from the WD community I had to sanitize the titles I am importing through WikidataIntegrator (see https://www.wikidata.org/wiki/User_talk:AdrianoRutz), before trying to fix it downstream, would there be a solution to implement here directly to format them better?