dzieciou / tree-labeller

Helps label training data using taxonomy information.
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

Create middle size dataset for annotation #32

Open dzieciou opened 1 year ago

dzieciou commented 1 year ago

Perhaps it should be from another domain because we might be biased too much

Ideas for datasets:

Sources:

dzieciou commented 1 year ago

Getting data into virtuoso (based on https://docs.openlinksw.com/virtuoso/rdfperfloading/):

DB.DBA.LD_DIR ('/usr/share/virtuoso-opensource-7/vad', '%.nt', 'https://bnb.data.bl.uk');
DB.DBA.rdf_loader_run ();

Sparql query to get dataset:

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
prefix blt: <http://www.bl.uk/schemas/bibliographic/blterms>

select 
   ?book 
   ?title 
   group_concat(distinct ?ddc_label;SEPARATOR=';') as ?ddcs 
   group_concat(distinct ?lchs_label;SEPARATOR=';') as ?lchss
   min(?author) as ?first_author  
   min(?abstract) as ?abstract2  
where  
{
   ?book <http://purl.org/dc/terms/subject> ?ddc.
   ?book <http://purl.org/dc/terms/subject> ?lchs.
   ?ddc rdf:type <http://www.bl.uk/schemas/bibliographic/blterms#TopicDDC>.
   ?ddc rdfs:label ?ddc_label.
   ?lchs rdf:type <http://www.bl.uk/schemas/bibliographic/blterms#TopicLCSH>.
   ?lchs rdfs:label ?lchs_label.
   ?book rdf:type <http://schema.org/Book> .
   ?book <http://purl.org/dc/terms/creator> ?creator .
   ?creator <http://schema.org/name> ?author .
   ?book <http://purl.org/dc/terms/title> ?title .
   ?book <http://purl.org/dc/terms/abstract> ?abstract . 

}
group by ?book ?title

Sample result:

books.tsv.zip