DivePlus / diveplusdata

repository for datasets for the DIVE Plus demonstrator
0 stars 1 forks source link

diveplusdata

repository for datasets for the DIVE Plus demonstrator See http://diveproject.beeldengeluid.nl. Currently we have partial data from four collections and three vocabularies. These are in RDF Turtle.

Open Images

Consisting of the video collection itself as well as the GTAA thesaurus Open Images collection Originally from openbeelden.nl (3000 videos) Creative Commons – Attribution-Share Alike license (CC-by-SA) Subset in DIVE triple store is all videos from OpenImages, as found in an existing RDF conversion. For 510 videos crowd enrichments are added with relations entities, events, persons, places through crowdsourcing.

GTAA (Gemeenschappelijke Thesaurus Audiovisuele Archieven). From http://gtaa.beeldengeluid.nl . Entities from KB are also matched to this thesaurus.

All Open images content can be found in the oi_dive folder:

Radio bulletins from KB

Originally from http://www.delpher.nl/nl/radiobulletins. Size is 1.5 millions of items ANP has made the objects (JPGs, OCRs, ALTOs) in this set available under a CC-BY-NC-ND 3.0 license (link is external).

These are all news bulletins between 1937 (start of collection) and May 1955 (when my script crashed initially).

OLD: Subset in DIVE triple store is 2210 digitized typoscripts (radio news scripts, to be read during news broadcasts) from the period 1937-1984.These were chosen to match the Openbeelden subset by re-using terms found there for the search request.

Amsterdam museum

Originally from https://bitbucket.org/biktorrr/amlod (thousands of objects) Which was derived from public collection website (www.amsterdammuseum.nl/collectie/) Data published under Creative Commons Attribution license (CC-by)

The collection and vocabularies are found in the am_dive folder.

Collection Subset Now, all objects from the datadump are in. In total, there are 73,447 objects with 5.7Million triples.

Amsterdam Museum Thesaurus the thesaurus used to annotate AM objects. This is (partially) aligned with GTAA.

Tropenmuseum

Originally from the old 2004 data we converted for the eculture project (eculture.multimedian.nl). This consists of some 70.000 objects. This is most likely the same as http://www.opencultuurdata.nl/wiki/tropenmuseum/. That dataset has CC-by-SA 3.0.

Changes made are a) added new locations of images based on the updated wereldculturen.nl api. image URIs are based on work IDs and b) changed the namespace of kit triples from hash to slash URIs and removed whitespaces from URIs

The collection and vocabularies are found in the kit_dive folder.

Collection subset Now all objects are in the dataset. This adds up to 78,270 objects and 1.9 Million triples

SVCN the thesaurus used to annotate Tropenmuseum objects. This is (partially) aligned with GTAA. There is also a list of actors (for now unaligned)