cern-sis / issues-scoap3

0 stars 0 forks source link

repo: check record with extra file (without extention) #238

Open pamfilos opened 7 months ago

pamfilos commented 7 months ago

https://repo.scoap3.org/records/17163

ErnestaP commented 7 months ago

Looks like the file that was downloaded was wrong and not intentional. Hindawi comes from API, so it means that not publisher uploaded the file, but we downloaded it.

The file without extension is the HMTL file of the article. We can read it by following url: https://www.hindawi.com/journals/ahep/2016/9258106/ Most likely it has differences, from the one in our repo, since the one we have downloaded is from 2019.

The API for files looks really similar, just subdomain is different: downloads https://downloads.hindawi.com/journals/ahep/2016/9258106.pdf

For me, looks like the API which was used for files download was incorrect, someone used https://www.hindawi.com/journals/ahep/2016/9258106/ instead of

https://downloads.hindawi.com/journals/ahep/2016/9258106.pdf or https://downloads.hindawi.com/journals/ahep/2016/9258106.xml


When and Why? I have a feeling that it happened 6 years ago, in this commit: https://github.com/SCOAP3/scoap3-next/blob/612d69f4dd40aadee6a26c158ad8d4f813e1fc2a/scoap3/modules/workflows/workflows/articles_upload.py#L214-L231

just later was added a step with building correct structure for attaching files: https://github.com/SCOAP3/scoap3-next/blob/b4703326a6041a371c9ab56fa7539709897653ec/scoap3/modules/workflows/workflows/articles_upload.py#L314-L338