dainst / ariadne-portal

MIT License
0 stars 1 forks source link

ADS Duplicates #185

Open hew503 opened 8 years ago

hew503 commented 8 years ago

There are duplicates within the ArchSearch: ADS catalogue resource. The duplication doesn't seem to be within our metadata, but its hard to show that since the upload was so massive it had to be divided into more than a dozen files in MORe. Everything isn't duplicated, so perhaps there was overlap when it was divided? That's all I can think of.

Here is an example:

http://portal.ariadne-infrastructure.eu/page/10354002 http://portal.ariadne-infrastructure.eu/page/11460856

Is it possible to search for duplicate Original IDs within the ArchSearch: ADS catalogue resource, and fix it that way?

cheers

Holly

eafiontzi commented 8 years ago

Hello, After thourough checking, I found the two xml files harvested to create the two different packages containing the two resources you mention. (You can see them here http://more.locloud.eu:8080/objects/3025/11460856/Archsearch/content and http://more.locloud.eu:8080/objects/2858/10354002/Archsearch/content) The specific resource exists in both files, that is why you see it twice. I can send you these two files if you want.

hew503 commented 8 years ago

Hi Eleni

Probably the best thing to do is to check with Dimitra - Nefeli on this. My understanding is we sent the ArchSearch metadata to ATHENA as an XML file, but at 1.3 million records, ATHENA broke the file up into multiple uploads so that MORe could handle it. I'm not sure how that was done, and why it resulted in duplication, but presumably that's where the problem is. I'm not sure if its easier for ATHENA to go back to the original XML we sent and try to break it up without causing duplication, or try to remove the duplicates, but as ADS wasn't involved in the upload, we don't really know how things were done in MORe.

Many thanks

Holly

eafiontzi commented 8 years ago

Hello again, Unfortunately, Dimitra-Nefeli has left from DCU (Leonidas l.papachristopoulos@dcu.gr is in charge of the content now), but she informed me that we have received the data already broken in folders. I am forwarding you the responding email. Maybe the duplicates were caused in the OAI-PHM harvesting of ALL sets when the folders were created?