cern-sis / issues-scoap3

0 stars 0 forks source link

APS articles not harvested in the repo #259

Closed agentilb closed 5 months ago

agentilb commented 6 months ago

It seems there was the same problem as for Inspire with APS. We need to reharvest all those articles. (see attached).

APS HEP November 2023 Articles Missing from CERN Repository 12042023[57].xlsx

ErnestaP commented 6 months ago

Hi @agentilb , any update on this? :)

ErnestaP commented 6 months ago

I harvested articles from 2023-11-01 to 2023-11-30 on the 6th of Dec

agentilb commented 6 months ago

Hi @ErnestaP,

It seems that 3 articles from the list are still missing:

10.1103/PhysRevD.108.092007 10.1103/PhysRevLett.131.221401 10.1103/PhysRevLett.131.221802

Could you check why?

ErnestaP commented 6 months ago
  1. it is in halted articles: https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fflt0_21%3D2&id=04c54912-9424-11ee-8202-526712898810, because of doubled affiliations, the fix was implemented the day after. I will harvest it again
  2. 10.1103/PhysRevLett.131.221401: can you please verify the date? The date in xlsx file is 2023-11-27, however, the article is not in API: https://harvest.aps.org/v2/journals/articles?from=2023-11-27&until=2023-11-27
  3. 10.1103/PhysRevLett.131.221802 The article is not in API as well according to the dates on xlsx file 2023-11-29: https://harvest.aps.org/v2/journals/articles?from=2023-11-29&until=2023-11-29
agentilb commented 6 months ago

For 2 and 3, the dates are the publication dates, and they seem to be correct. Is it possible to check the API for the days after, to check if the articles are not there?

ErnestaP commented 6 months ago

sure, I will keep you informed :)

agentilb commented 6 months ago

Could you also try to reharvest those articles, they had this duplicate affiliation issue, but they should have been corrected by APS now.

10.1103/PhysRevLett.131.091802 -> 29 August 2023 10.1103/PhysRevLett.131.071901 -> 14 August 2023 10.1103/PhysRevLett.131.091901 -> 29 August 2023 10.1103/PhysRevD.108.012021 -> 28 July 2023 10.1103/PhysRevD.108.012023 -> 28 July 2023 10.1103/PhysRevLett.131.111802 -> 13 September 2023 10.1103/PhysRevD.108.012007 -> 14 July 2023 10.1103/PhysRevD.108.012008 -> 13 July 2023

(I have added the publication date).

Thanks in advance!

ErnestaP commented 6 months ago

10.1103/PhysRevD.108.092007 - in the repo https://repo.scoap3.org/records/82263 10.1103/PhysRevLett.131.221401 - in the repo https://repo.scoap3.org/records/82262 10.1103/PhysRevLett.131.221802 - I was not able to find :/ I will try again after checking the other articles

ErnestaP commented 6 months ago

halted: 10.1103/PhysRevLett.131.111802 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D1%26flt0_21%3D2&id=1a802c32-9e69-11ee-a824-029ec3c926e4

10.1103/PhysRevD.108.012007 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D2%26flt0_21%3D2&id=199c756e-9e69-11ee-89d9-8625598c2545

10.1103/PhysRevD.108.012008 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D2%26flt0_21%3D2&id=1826b21c-9e69-11ee-aaec-663d42099f7a

I was not able to find: 10.1103/PhysRevLett.131.071901 -> 14 August 2023 10.1103/PhysRevLett.131.091901 -> 29 August 2023 10.1103/PhysRevD.108.012021 -> 28 July 2023 10.1103/PhysRevD.108.012023 -> 28 July 2023 10.1103/PhysRevLett.131.111802 -> 13 September 2023

agentilb commented 6 months ago

The 3 first articles are now halted because of the arXiv category, together with the 250 others that were just reharvested.. Do you think it will be possible to clean this? This time, I cannot do it manually... It is weird though that this problem seems to appear mostly for APS articles.

agentilb commented 6 months ago

The ones you cannot find were actually already halted, does it help to find them again?

10.1103/PhysRevLett.131.071901 -> 14 August 2023 (2023-10-21 00:30:24.429612) 10.1103/PhysRevLett.131.091901 -> 29 August 2023 (2023-10-21 00:30:24.224975) 10.1103/PhysRevD.108.012021 -> 28 July 2023 (2023-09-22 00:30:18.400387) 10.1103/PhysRevD.108.012023 -> 28 July 2023 (2023-09-22 00:30:17.894814) 10.1103/PhysRevLett.131.091802 -> 29 August 2023 (2023-10-21 00:30:24.771379)

ErnestaP commented 6 months ago

can you please add the links to the halted articles? Maybe they can be fixed manually?

ErnestaP commented 6 months ago

yes, for halted article is just APS. I had to harvest them by the date range of the articles you sent me before: from July 13 to September 13, that's why are so many, I cannot do much about them now, I am afraid

agentilb commented 6 months ago

I have tried to modify this one: 10.1103/PhysRevLett.131.091802 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=06b41ce8-6fa9-11ee-9bfc-6280630c062e Now this is in ERROR mode...

Here are the other links:

10.1103/PhysRevD.108.012023 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=349ff9b4-58df-11ee-b586-aa2b8beba377

10.1103/PhysRevD.108.012021 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=34eced00-58df-11ee-963b-e28678992ea2

10.1103/PhysRevLett.131.091901 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=066044ba-6fa9-11ee-aa90-f67e7dd5bb4f

10.1103/PhysRevLett.131.071901 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=067ff328-6fa9-11ee-b33e-eeb6bc60d2db

agentilb commented 6 months ago

yes, for halted article is just APS. I had to harvest them by the date range of the articles you sent me before: from July 13 to September 13, that's why are so many, I cannot do much about them now, I am afraid

Most of those were already ok in the repo. Can we clean the halted records that were already in the repo?

ErnestaP commented 6 months ago

Looks like the issue is always with the same author:

10.1103/PhysRevLett.131.091802 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=06b41ce8-6fa9-11ee-9bfc-6280630c062e --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevLett.131.091802

10.1103/PhysRevD.108.012023 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=349ff9b4-58df-11ee-b586-aa2b8beba377 --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevD.108.012023

10.1103/PhysRevD.108.012021 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=34eced00-58df-11ee-963b-e28678992ea2 --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevD.108.012021

10.1103/PhysRevLett.131.091901 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=066044ba-6fa9-11ee-aa90-f67e7dd5bb4f --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevLett.131.091901

10.1103/PhysRevLett.131.071901 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=067ff328-6fa9-11ee-b33e-eeb6bc60d2db --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevLett.131.071901

ErnestaP commented 6 months ago

the APS Halted articles are cleaned :)

agentilb commented 6 months ago

Thanks a lot Ernesta. It seems there is still this one to be cleaned: https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fflt0_21%3D2&id=154eff4a-9e69-11ee-9651-46aa8dc5dde2 The article is in the repo.

For the other articles listed above, did you try to reharvest them or not?

ErnestaP commented 6 months ago

I think you understood me wrong, all of these articles don't have affiliationid for author {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"}: 10.1103/PhysRevLett.131.091802 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=06b41ce8-6fa9-11ee-9bfc-6280630c062e --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevLett.131.091802

10.1103/PhysRevD.108.012023 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=349ff9b4-58df-11ee-b586-aa2b8beba377 --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevD.108.012023

10.1103/PhysRevD.108.012021 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=34eced00-58df-11ee-963b-e28678992ea2 --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevD.108.012021

10.1103/PhysRevLett.131.091901 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=066044ba-6fa9-11ee-aa90-f67e7dd5bb4f --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevLett.131.091901

10.1103/PhysRevLett.131.071901 https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fpage%3D13%26flt0_21%3D2&id=067ff328-6fa9-11ee-b33e-eeb6bc60d2db --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"} ---original data from APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevLett.131.071901

Also the one you gave me now has the same issue: https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fflt0_21%3D2&id=154eff4a-9e69-11ee-9651-46aa8dc5dde2 They don't have to be cleaned. APS have to be contacted and these articles have to be fixed.

ErnestaP commented 6 months ago

For the other articles listed above, did you try to reharvest them or not? No, we found another way, I restarted the workflows of these articles

agentilb commented 6 months ago

Ah yes, sorry, actually in this case, it means that the author really doesn't have affiliation. It happens from time to time. We had a case recently, but I'm not sure how we handled it. Can we force the workflow?

agentilb commented 6 months ago

And for this one: https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fflt0_21%3D2&id=154eff4a-9e69-11ee-9651-46aa8dc5dde2

It is already in the repo (with no affiliation for this author): https://repo.scoap3.org/records/80076

ErnestaP commented 6 months ago

Interesting, I cannot tell how this article even appears in the repo. I will ask Harris tomorrow, maybe he has an idea

ErnestaP commented 6 months ago

I found a few more from a task https://github.com/cern-sis/issues-scoap3/issues/187, I restarted the workflows for these articles, they were in error stated because of duplicated affiliations. Duplication error was fixed, but no affiliation error now is present: Same issue, same author --- there is no affiliationID: {"type":"Person","name":"A. Bizzeti","firstname":"A.","surname":"Bizzeti"}

https://repo.scoap3.org/admin/workflow/edit/?id=9be85834-21dd-11ee-93d5-c20792d59997 https://repo.scoap3.org/admin/workflow/edit/?id=1a87662e-2da7-11ee-9614-4ee3c1f0aa14 https://repo.scoap3.org/admin/workflow/edit/?id=19d9aaf2-2da7-11ee-b33e-eeb6bc60d2db

agentilb commented 6 months ago

Can we force the harvest for those articles? There is nothing we can do for authors with no affiliation.

drjova commented 6 months ago

@agentilb this will break the new data model as we don't accept authors without affiliations. I can see 4 options:

  1. Delete the author
  2. Don't accept these articles
  3. Add a placeholder, but this will make a mess to our data
  4. Relax the data model

let me know what do you think.

agentilb commented 6 months ago

1 and 2 are out of the table. 3 is an option, but indeed, it is risky for the metadata. Remains 4. But probably we should still monitor this to distinguish the cases where there is a problem in the metadata and the genuine cases. I'm not sure how to handle it though. I'm sure we already have this in the repo, so the data model allowed it before. https://github.com/cern-sis/issues-scoap3/issues/172

drjova commented 6 months ago

Yes they were allowed for unknown reasons, some of the ~90 records are pretty old 2015, 2016 etc and my only guess is that validation step has been skipped manually or a schema changed after.

As far as I understand these cases are valid and we should allow them right? So I would suggest to modify the (new) data model and add a compliance check to mark the articles without affiliations. WDYT?

agentilb commented 6 months ago

It is good idea, it could be a compliance criteria so we can find them and check if this is valid or not. But the overall compliance status of the article should be ok even if this compliance criteria is not met, otherwise those articles will never be compliant

drjova commented 6 months ago

ok thanks, @agentilb could you please include it to your feedback for the new system?

@ErnestaP let's skip the validation step for now and allow to create the record

ErnestaP commented 6 months ago

Some of the articles after skipping to the next step, jumped back to the double affiliation issue, and when I restarted the workflow, it jumped back to the author affiliation issue. I am afraid we need to reharvest them, in order to have a single affiliation instead of duplication and then skip the validation

ErnestaP commented 6 months ago

Are in the repo: 10.1103/PhysRevLett.131.091802 https://repo.scoap3.org/records/82317

10.1103/PhysRevLett.131.091901 https://repo.scoap3.org/records/82318

10.1103/PhysRevLett.131.071901 https://repo.scoap3.org/records/82319

10.1103/PhysRevD.108.012023 https://repo.scoap3.org/records/82321

10.1103/PhysRevD.108.012021 https://repo.scoap3.org/records/82322

ErnestaP commented 6 months ago

Also, articles from the previously mentioned task are in the repo: 10.1103/PhysRevD.108.012007 https://repo.scoap3.org/records/82285

10.1103/PhysRevD.108.012008 https://repo.scoap3.org/records/82283

10.1103/PhysRevD.108.012021 https://repo.scoap3.org/records/82322

10.1103/PhysRevD.108.012023 https://repo.scoap3.org/records/82321

ErnestaP commented 6 months ago

Can we close the issue regarding APS?

@agentilb found Elsevier articles as well without affiliations, harvested today. Can you please check? https://github.com/cern-sis/issues-scoap3/issues/268

agentilb commented 6 months ago

I realise there is a last article that seems to be missing in the repo: 10.1103/PhysRevLett.131.061901 but the publisher claims it was corrected. Could you please try to re-harvest it as you did for the others? Then, we should be ok with APS!

agentilb commented 6 months ago

It is currently in halted mode: https://repo.scoap3.org/admin/workflow/details/?url=%2Fadmin%2Fworkflow%2F%3Fflt0_21%3D2&id=eda60a28-9423-11ee-b746-2ef1f4d60b68

ErnestaP commented 5 months ago

@agentilb I manage to get the article is in the repo: https://repo.scoap3.org/records/82573

agentilb commented 5 months ago

I think everything is in order with this ticket.