Orbis-Cascade-Alliance / harvester

XForms-based OAI-PMH harvester for Orbis Cascade. Metadata are transformed into RDF and posted into a triplestore for access from finding aids.
9 stars 1 forks source link

De-dupe language values #40

Closed adehner closed 7 years ago

adehner commented 7 years ago

After transforming language names to 639-2 codes, please de-dupe language values in a single record.

See harvester specs:

screen shot 2017-04-14 at 4 11 39 pm

Duplicate language values can be seen in the following sets: http://content.wwu.edu/oai/oai.php?verb=ListRecords&metadataPrefix=oai_dc&set=walc http://cedar.wwu.edu/do/oai/?verb=ListRecords&set=publication:klipsun_magazine&metadataPrefix=simple-dublin-core

@jallibunn

ewg118 commented 7 years ago

Does it matter if there's no actual duplication once harvested (only in the preview)?

On Apr 14, 2017 7:15 PM, "adehner" notifications@github.com wrote:

After transforming language names to 639-2 codes, please de-dupe language values in a single record.

See harvester specs: [image: screen shot 2017-04-14 at 4 11 39 pm] https://cloud.githubusercontent.com/assets/6174191/25058337/291a3276-212d-11e7-9049-bb5797e2f032.png

Duplicate language values can be seen in the following sets: http://content.wwu.edu/oai/oai.php?verb=ListRecords& metadataPrefix=oai_dc&set=walc http://cedar.wwu.edu/do/oai/?verb=ListRecords&set= publication:klipsun_magazine&metadataPrefix=simple-dublin-core

@jallibunn https://github.com/jallibunn

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Orbis-Cascade-Alliance/harvester/issues/40, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiX87zrY9bO895K7YrPWuANxykLdbhVks5rv_4egaJpZM4M-K8W .

adehner commented 7 years ago

Yes, the preview should be as accurate as possible.

ewg118 commented 7 years ago

Okay, I have changed the code so that all possible language values (in multiple dc:language elements and multiple labels separated by ; in one or more elements) normalized first, and then duplicates weeded out before generating unique dcterms:language values.