Open dodeeric opened 2 weeks ago
jq '.[].url' commons-Category_Engravings_by_Dodeeric-swp.json
$ jq '.[].url' *.json
$ jq '.[].url' *.json | wc -l 1985
A) for commons: scape by Category
==> list all the categories:
ls | grep commons-Category | cut -c18-170 | cut -d"." -f1 | sed "s/-swp//g" > ../../list-commons-categories.res
B) for all the other urls: scrape by URLs
list of urls per type:
==> jq '.[].url' balat.json | cut -d'"' -f2 ===> 230 urls ==> jq '.[].url' belgica.json | cut -d'"' -f2 ===> 52 urls ==> jq '.[].url' www.europeana*.json | cut -d'"' -f2 ===> 234
$ jq '.[].url' balat.json | wc -l 230 $ jq '.[].url' belgica.json | wc -l 52 $ jq '.[].url' www.euro.json | wc -l 234 $ jq '.[].url' commons.json | wc -l 1467
===
Total: 1983 (1985 in admin list because there is two opac files: we can omit them)
Number of JSON files: 343
Number of web pages: 1985
Number of PDF files: 1
Number of PDF pages: 2
Number of web and pdf pages: 1987
The Chroma vector DB is located on myvm2.edocloud.be:8000.
DB size: 49800.0 KB
-rw-rw-r-- 1 dodeeric dodeeric 4340 Jun 28 14:27 list-commons-categories.res ==> 104 categories -rw-rw-r-- 1 dodeeric dodeeric 15856 Jun 28 14:32 list-europeana-urls.res -rw-rw-r-- 1 dodeeric dodeeric 9355 Jun 28 14:31 list-irpa-urls.res -rw-rw-r-- 1 dodeeric dodeeric 2600 Jun 28 14:31 list-kbr-urls.res
get the urls:
europeana: retrieve url field for each json file