VTUL / vtechworks

DSpace at Virginia Tech
http://vtechworks.lib.vt.edu
Other
6 stars 8 forks source link

Decide if any collections should not be harvested by DCG #712

Closed alawvt closed 3 years ago

alawvt commented 4 years ago

Extends #710 Decide if any collections should not be harvested by OCLC DCG for Discovery.

gailmmac commented 4 years ago

Why not harvest everything in VTechWorks?

alawvt commented 4 years ago

That's easy: we are unanimous on harvesting all collections.

alawvt commented 4 years ago

Consider creating a record blocking list to not harvest dc.type=abstract or dc.type=citation (perhaps for certain collections like SANREM).

alawvt commented 4 years ago

We have just learned that matching that DCG uses to prevent duplicate records from being harvested, treats the same item mapped to two collections as different records.

One way to minimize duplicate records is to not harvest collections where everything is mapped. Candidates are:

@cecross1 or @pyc1, would you comment on these collections or list other collections where everything is mapped?

Another strategy would be to add an administrative field that notes the mapping of an item, but that would add quite a bit of extra work.

cecross1 commented 4 years ago

@alawvt @pyc1 I agree with this list. Nothing else comes to mind at the moment.

kdweeks commented 4 years ago

@alawvt The only thing that comes to mind offhand are Administrative and Test collections.

alawvt commented 4 years ago

OUr strategy for mapping collections in DCG is to map each top-level community, except the VTechWorks Archive collection, which contains the SWORD collections and VTechWorks Administration community. To exclude the mapped ETD collections, we can block items based on dc.types, thesis and dissertation, from their corresponding top-level communities. This will allow us to avoid having to map all the individual collections in those two top-level communities.