bibliographic post-processing improvement

Suggestion from Susan to reduce the amount of duplicate Works/expressions

If after processing we end up with

Work Title A and Author B with Date C and URI D, plus associated Expression (Record 1)

Work Title A and Author B with Date D, which is later than Date C, and URI E, plus associated Expression (Record 2)

Then replace URI E with URI D in all triples

And delete the Work with URI E

And so on for any additional Works whose author/title match those of URI D

This will likely cause too few Works to be created in some cases (e.g. those poets who just repeatedly published Poems that can only be distinguished in particular years. So we might want to exclude certain titles, such as Poems, Collected Poems, Works, Complete Works, Collected Works, Essays, Collected Essays, Prose Works, Collected Prose (if we grab the most frequently recurring words in titles that may help us decide on additional ones--this is just off the top of my head).

Does this seem feasible? (I have to admit that one thing I can't get my head around is how we deal with diffs as the files for these things change.

Looks more feasible than altering the conversion process since the current scripts are a placeholder until CWRC has its new schema in place instead of MODS.

This will likely be a bit of a slow process since we have so many records and there would be quite a few triples to delete since every work/expression has title and timespan triples associated.

Potentially: I think we could keep the expression from record 2 and have that realize the work from record 1. But would that be too much duplication as you'd still have fairly similar expressions? Expressions might have different edition information associated that might make a difference.

related questions:

[ ] What if the date is identical?
[ ] Multiple authors listed in one vs the other?
[ ] What happens to genres attached to Record 2? will get merged?
[ ] How will this impact Writing extraction, when an entry references a merged record? Doing lookups with every mention of a work would be expensive.

Next steps:

[ ] sample queries to get at similar works (determine how many records this could reduce
[ ] further discussion about the above questions and results.

If the date and the publisher are identical then I think we can safely create a single Work and a single Expression for both Record 1 and Record 2.
If there are multiple authors then it is probably safest to have multiple Works and multiple Expressions.
- We should test my assumptions against some examples of this before deciding, but I expect the most likely case here is that the additional author(s) will be in an editor role, or have written an introduction to the other work
- If that is the case, then it would be best to create a Work for both and to create a link between them to indicate the relationship through the FRBRoo R2 derivative relationship
- The newer work should have the genres of the older work plus the genres of the newer work, but the genres of the older work should be left as is.
We probably need to discuss the impact on writing extraction. If it will be too costly to do this at the extraction phase, then perhaps this could be better handled with cleanup in RS, based on a report of similar entities. Or by having a phase in which we use VERSD on the subset of bibliographic records that are similar, and then do a find and replace of URIs related to merged entities across the entire dataset.

A question of timing: is it best to do this sooner or to wait on the firming up of the CWRC 2.0 biblio schema?

cwrc / RDF-extraction

bibliographic post-processing improvement #49