cern-sis / issues-scoap3

0 stars 0 forks source link

Add option to harvest article by DOI #277

Open ErnestaP opened 5 months ago

ErnestaP commented 5 months ago

Add the option to harvest and re-harvest article by DOI. In old SCOAP3 we are facing situation quite often when we need a specific article to be harvested/re-harvested but this option is not supported. HINDAWI and APS APIs have an option to get article by DOI. More complicated situation will be implementing it for publishers which are harvested from FTP/STFTP. These articles might be just re-harvested by DOI, but not harvested by DOI since they are in zip. When we are unzipping them each article is saved in separate file

ErnestaP commented 5 months ago

details and examples: APIs for harvesting by doi: You need only to pass the doi in the URL Hindawi: https://www.hindawi.com/oai-pmh/oai.aspx?verb=getrecord&identifier=oai:hindawi.com:10.1155/2023/8127604&metadataprefix=oai_dc APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevLett.131.231901

Elsevier: Elsevier is harvested from SFTP. The files that it has there are zip and tar. Harvesting by doi: IF THE ARTICLE IS IN SFTP, we can read the content of zip/tar and take only the article we need. It should not be difficult, since Elsevier has mapping where in zip/tar files, the articles are located. IF THE ARTICLE IS NOT IN SFTP (older zips/tars are deleted) we can re-process the articles that we already have in our s3, but for some reason not in the repo. Need to verify if the naming of saved articles reflects/can reflect the DOI.

OUP: Is harvested from FTP. The files that it has there are zip. They should be deleted from SFTP after harvesting because OUP uploads the updates with the same names, so it means they would overwrite the old files with the changes (which could be new articles, updates of the previous articles, etc.) Harvesting by doi: we should re-process the articles that we already have in our s3, but for some reason are not in the repo.Since the articles are already deleted from SFTP after the first harvest. If the articles were never harvested before, so they should be in SFTP. If they are not there, ask OUP to upload them

IOP: Is harvested from SFTP. The files that it has there are zip. Harvesting by doi: IF THE ARTICLE IS IN SFTP. As OUP it also has the all locations of files written in the mapping. However, this time the mapping is in txt. We can read the mapping and download only the articles we need. IF THE ARTICLE IS NOT IN SFTP (older zips/tars are deleted) we can re-process the articles that we already have in our s3. Need to verify if the naming of saved articles reflects/can reflect the DOI.

Springer Is harvested from SFTP. Harvesting by doi: doesn't have any mapping. If we don't have this article at all, we will need to harvest all the zips from Springer SFTP, which ones are not in our s3. If we have the article already in s3, but for some reason is not in the repo, we can re-process it again.