Open pamfilos opened 7 months ago
Do we want to do something with this article? The errors from the older articles are hard to trace. The path of parsing articles' values changed with time (sadly, sometimes I needed to adapt the code regarding changes), or the code itself was not written correctly. This article was updated in 2018, so the issue regarding page_start had to be solved then. There is no page_start in the current OUP parser: https://github.com/SCOAP3/hepcrawl/blob/master/hepcrawl/extractors/oup_parser.py
We can see the same problem also in another article, from the same harvesting period: https://repo.scoap3.org/records/10194 I believe there is no sense in understanding the logic of parsing before 2020 or even 2021. There were (and still are) so many bugs and mistakes in the code that are hard to catch! It is crucial to have new (from 2021) articles parsed correctly. https://repo.scoap3.org/records/60075 (2023) https://repo.scoap3.org/records/75414 (2022) https://repo.scoap3.org/records/68393 (2021) https://repo.scoap3.org/records/68393 (2020)
Also, the publisher has kind of weird value: Oxford University Press/Physical Society of Japan, :
Now we have a mapping for it: https://github.com/SCOAP3/scoap3-next/blob/master/scoap3/config.py#L568
There are articles (e.g. article 10195 ). That has in the publication info
page_number
thearticle_id
valueSo in the XML we have:
<elocation-id>043C01</elocation-id>
But instead of mapping
elocation-id
toartid
, we map it to the page number field