WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
215 stars 78 forks source link

CEURWS scraper fails on empty page name #2395

Open fnielsen opened 6 months ago

fnielsen commented 6 months ago

Describe the bug CEURWS scraper fails on empty page name

To Reproduce Steps to reproduce the behavior:

  1. python -m scholia.scrape.ceurws proceedings-url-to-quickstatements https://ceur-ws.org/Vol-3592/
  2. Generates 'LAST P304 " "' for https://ceur-ws.org/Vol-3592/paper9.pdf

Expected behavior No page number should probably be generated here.

WolfgangFahl commented 6 months ago

http://ceurspt.wikidata.dbis.rwth-aachen.de/Vol-3592/paper9.html

https://www.wikidata.org/wiki/Property:P304 - pages

I also get an empty cvb.pages with our parser ... see http://ceurspt.wikidata.dbis.rwth-aachen.de/Vol-3592/paper9.json or http://ceurspt.wikidata.dbis.rwth-aachen.de/Vol-3592/paper9.yaml

cvb.authors: Fernando Zhapa-Camacho,Robert Hoehndorf
cvb.fail: null
cvb.id: Vol-3592/paper9
cvb.pages: ' '
cvb.pdf_name: paper9.pdf
cvb.title: Evaluating Different Methods for Semantic Reasoning Over Ontologies
cvb.vol_number: 3592
spt.description: null
spt.html_url: /Vol-3592/paper9.html
spt.id: Vol-3592/paper9
spt.pdfUrl: !!python/object/new:linkml_runtime.utils.metamodelcore.URI
  args:
  - https://ceur-ws.org/Vol-3592/paper9.pdf
  state:
    _len: null
    _s: null
spt.session: null
spt.title: Evaluating Different Methods for Semantic Reasoning Over Ontologies
spt.volume:
  acronym: Scholarly QALD 2023 and SemREC 2023
  date: '2023-12-14'
  dblp: null
  description: proceedings from  Scholarly QALD 2023 and SemREC 2023
  k10plus: null
  number: 3592
  title: Joint Proceedings of Scholarly QALD 2023 and SemREC 2023
  url: null
  urn: urn:nbn:de:0074-3592-5
  wikidataid: Q123966140
spt.wikidataid: null
version.cm_url: https://github.com/ceurws/ceur-spt
version.version: 0.0.7
fnielsen commented 6 months ago

I guess we must continuously be attentive to oddities.