cwrc / RDF-extraction

0 stars 0 forks source link

bibliographic post-processing improvement #49

Open alliyya opened 1 year ago

alliyya commented 1 year ago

Suggestion from Susan to reduce the amount of duplicate Works/expressions

  • If after processing we end up with
    • Work Title A and Author B with Date C and URI D, plus associated Expression (Record 1)
    • Work Title A and Author B with Date D, which is later than Date C, and URI E, plus associated Expression (Record 2)
  • Then replace URI E with URI D in all triples
  • And delete the Work with URI E
  • And so on for any additional Works whose author/title match those of URI D

This will likely cause too few Works to be created in some cases (e.g. those poets who just repeatedly published Poems that can only be distinguished in particular years. So we might want to exclude certain titles, such as Poems, Collected Poems, Works, Complete Works, Collected Works, Essays, Collected Essays, Prose Works, Collected Prose (if we grab the most frequently recurring words in titles that may help us decide on additional ones--this is just off the top of my head).

Does this seem feasible? (I have to admit that one thing I can't get my head around is how we deal with diffs as the files for these things change.

Looks more feasible than altering the conversion process since the current scripts are a placeholder until CWRC has its new schema in place instead of MODS.

This will likely be a bit of a slow process since we have so many records and there would be quite a few triples to delete since every work/expression has title and timespan triples associated.

Potentially: I think we could keep the expression from record 2 and have that realize the work from record 1. But would that be too much duplication as you'd still have fairly similar expressions? Expressions might have different edition information associated that might make a difference.

related questions:

Next steps:

SusanBrown commented 1 year ago

A question of timing: is it best to do this sooner or to wait on the firming up of the CWRC 2.0 biblio schema?