gbv / k10plus-subjects

Subject analysis of records in K10plus catalogue
0 stars 0 forks source link

Filter out invalid notations #2

Closed nichtich closed 2 years ago

nichtich commented 2 years ago

Notations given in PICA contain errors, so better filter out syntactically invalid notations. Keeping them in the full TSV files (e.g. rvk.tsv) is useful but further processing should remove them:

We may add additional rules for cleanup, e.g. remove / in DDC notations, normalize whitespace...

nichtich commented 2 years ago

Note that cleanup can lead to additional duplicated notations so order of processing is relevant.

nichtich commented 2 years ago

Cleanup may also include looking up whether a notation actually exists, e.g. lnot every pattern ^[0-9][0-9].[0-9][0-9]$ is valid BK. As each vocabulary has its own content and rules, best create a service for each vocabulary.