WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
215 stars 78 forks source link

OJS title extraction may differ between ":" and " - " #2420

Closed fnielsen closed 3 months ago

fnielsen commented 5 months ago

Describe the bug OJS title extraction may differ between ":" and " - ". Scraping https://tidsskrift.dk/samfundsokonomen/article/view/143319 should give a title with ":" according to metadata on the page, but the quickstatements generated contain a form of dash: "Civilsamfund og velfærdsstat – konflikt, samarbejde eller begge dele?"

This is also the case with https://tidsskrift.dk/samfundsokonomen/article/view/143319

It means that the article is not match, - it is already in Wikidata, e.g., https://scholia.toolforge.org/work/Q123561283

To Reproduce Steps to reproduce the behavior:

  1. python -m scholia.scrape.ojs issue-url-to-quickstatements https://tidsskrift.dk/samfundsokonomen/issue/view/10780
  2. LAST Len "Civilsamfund og velfærdsstat – konflikt, samarbejde eller begge dele?"

Expected behavior Should identify https://scholia.toolforge.org/work/Q123561283

faresh9 commented 5 months ago

how to know that (https://scholia.toolforge.org/work/Q123561283) is identified?

fnielsen commented 5 months ago

how to know that (https://scholia.toolforge.org/work/Q123561283) is identified?

It seem that the title is displayed/set differently on the OJS pages.

faresh9 commented 5 months ago

So it should be Civilsamfund og velfærdsstat: konflikt, samarbejde eller begge dele? instead?, what the output should look like?, and if solved the identified articles should not have any entries in the output?, i assume if the article is identified it gonna be commented and put at the end of the output, i don't know.

fnielsen commented 5 months ago

Good question. The PDF has the dash which the meta information has a colon. It is unclear to me how I scrape the dash version...

faresh9 commented 5 months ago

Is this problem with all the articles that contains a dash or a colon?

fnielsen commented 5 months ago

Is this problem with all the articles that contains a dash or a colon?

No, I do not think so.