Closed nichtich closed 2 years ago
This would also make it easier to build an occurrences API on top of that (because the subjects.tsv
file could be obtained directly without downloading a 14 GB data set).
The reduced-subjects dump (only notations matching notationPattern
) 2022-06-30 is 1.8GB with 23.306.480 records having 77.705.044 notations in total.
ls -lh reduced-subjects.tsv | awk '{print $5}'
awk '{print $1}' reduced-subjects.tsv | uniq | wc -l
wc -l reduced-subjects.tsv
The number of notations per vocabulary is as following:
$ awk '{print $2}' reduced-subjects.tsv | perl -lnE '$h{$_}++; END{printf "$_\t%9d\n", $h{$_} for keys %h}' | sort -nk2
asb 5340
stw 104118
nlm 129289
ssd 153242
kab 159543
sfb 432141
sdnb 4593798
ddc 9248794
rvk 10172838
bk 13321229
gnd 39384712
Ill first do #1 before releasing the data as dump.
Done, see https://doi.org/10.5281/zenodo.7016625
We should provide the generated file
subjects.tsv
as another research data publication. Requires #2 and #3.