gbv / subjects-api

JSKOS Concept Occurrences Provider implementation
https://coli-conc.gbv.de/subjects/
MIT License
0 stars 0 forks source link

Daily updates #24

Closed nichtich closed 1 year ago

nichtich commented 2 years ago

Related to #17 there should also be an update script that can handle partial updates. The update could be a .tsv or .tsv.gz file as well but it may include rows with empty vocabulary (just the PPN) to indicate removal of a record:

awk '{print $1}' update.tsv | uniq > ppns # filter out affected records
# TODO: remove rows with PPN in file ppns
awk -F'\t' '$2{print}' update.tsv > import.tsv # filter out rows with PPN only (records without subject indexing)
# TODO: import import.tsv into database without purging database

Alternatively keep a full dump as file and apply update to this file to get an updated full dump (may even be faster, depending on size of updates).

Use case: There are a daily jobs at K10plus CBS database to pass updated records to LBS and to K10plus central Solr index.

nichtich commented 2 years ago

TSV files are always grouped by PPN. The set of rows for each PPN is either as known, e.g:

12345   rvk      XY 333
12345   bk       33.33

resultung in rows [{voc: "rvk", notation: "XY 333"}, {voc: "bk", "notation": "33.33"}] or it's just one row with empty voc and notation to only delete the record (rows = []):

12345

See method updateRecord in SQLite Backend (dev branch) to be passed this parsed TSV data.

stefandesu commented 2 years ago

So the next step would be to add an update script that calls methods in the SQLite backend, and that also allows both partial and full updates? Something like:

# partial update by default
./bin/import update.tsv
# full update with flag
./bin/import --full subjects.tsv

Full updates would clear the whole table instead of deleting records for single PPNs, so we would likely need an additional method in the backend.

Also needs a --modified flag for #25 and update the modified metadata in the database.

stefandesu commented 2 years ago

@nichtich I feel like partial imports are not yet 100% clear. My suggestion for the TSV format for partial import would be this:

12345

= delete all records for PPN 12345

12345   rvk

= delete all RVK records for PPN 12345

12345   rvk XY 333

= add record for PPN 12345 (but do not delete anything)

For example, if the update would 1) remove the existing DDC record, 2) replace the one existing RVK record, and 3) add an addition BK record, it would look like this:

12345   ddc
12345   rvk
12345   rvk XY 333
12345   bk  33.33

Or would you prefer to do it differently? I think this would cover all cases, even though removal of a single record would mean all other record for that PPN/vocabulary would need to be listed again. (I think in your case, removal of a single record would mean ALL other records for that PPN, regardless of vocab, would need to be listed again.)

stefandesu commented 2 years ago

There's now a basic working implementation of the import script. It will be finished in #27.

nichtich commented 1 year ago

This is not part of the software but its deployment and configuration, so closing this issue.