cern-sis / issues-scoap3

0 stars 0 forks source link

OUP: check that artid is harvested correctly #240

Open pamfilos opened 7 months ago

pamfilos commented 7 months ago

There are articles (e.g. article 10195 ). That has in the publication info page_number the article_id value

So in the XML we have: <elocation-id>043C01</elocation-id>

But instead of mapping elocation-id to artid, we map it to the page number field

ErnestaP commented 7 months ago

Do we want to do something with this article? The errors from the older articles are hard to trace. The path of parsing articles' values changed with time (sadly, sometimes I needed to adapt the code regarding changes), or the code itself was not written correctly. This article was updated in 2018, so the issue regarding page_start had to be solved then. There is no page_start in the current OUP parser: https://github.com/SCOAP3/hepcrawl/blob/master/hepcrawl/extractors/oup_parser.py

We can see the same problem also in another article, from the same harvesting period: https://repo.scoap3.org/records/10194 I believe there is no sense in understanding the logic of parsing before 2020 or even 2021. There were (and still are) so many bugs and mistakes in the code that are hard to catch! It is crucial to have new (from 2021) articles parsed correctly. https://repo.scoap3.org/records/60075 (2023) https://repo.scoap3.org/records/75414 (2022) https://repo.scoap3.org/records/68393 (2021) https://repo.scoap3.org/records/68393 (2020)

Also, the publisher has kind of weird value: Oxford University Press/Physical Society of Japan, :

Screenshot 2023-11-15 at 12 16 30

Now we have a mapping for it: https://github.com/SCOAP3/scoap3-next/blob/master/scoap3/config.py#L568