gbv / k10plus-subjects

Subject analysis of records in K10plus catalogue
0 stars 0 forks source link

Provide cleaned subjects dumps #5

Closed nichtich closed 2 years ago

nichtich commented 2 years ago

We should provide the generated file subjects.tsv as another research data publication. Requires #2 and #3.

stefandesu commented 2 years ago

This would also make it easier to build an occurrences API on top of that (because the subjects.tsv file could be obtained directly without downloading a 14 GB data set).

nichtich commented 2 years ago

The reduced-subjects dump (only notations matching notationPattern) 2022-06-30 is 1.8GB with 23.306.480 records having 77.705.044 notations in total.

ls -lh reduced-subjects.tsv | awk '{print $5}'
awk '{print $1}' reduced-subjects.tsv | uniq | wc -l
wc -l reduced-subjects.tsv

The number of notations per vocabulary is as following:

$ awk '{print $2}' reduced-subjects.tsv | perl -lnE '$h{$_}++; END{printf "$_\t%9d\n", $h{$_} for keys %h}' | sort -nk2
asb      5340
stw    104118
nlm    129289
ssd    153242
kab    159543
sfb    432141
sdnb      4593798
ddc   9248794
rvk  10172838
bk   13321229
gnd  39384712

Ill first do #1 before releasing the data as dump.

nichtich commented 2 years ago

Done, see https://doi.org/10.5281/zenodo.7016625