Closed elrayle closed 5 years ago
@eichmann See if you can download the files using these links. If you do not have the right permissions I will ask the Casalini famiglia if they can add you, as this may be the easiest way to get the files to you. I think you should only need D2 and D3, but D1 is Stanford's basic set of triples. Explore the data and see what you think:
"D1_RDF_SVDE_URIs" stanford_nt
(our catalog converted to Bibframe 2.0, with links to the Share-VDE Cluster Knowledge Base):
https://www.dropbox.com/sh/l3kh8fgblcymplm/AAAfKvaI7WYtKHkjhupsGeHQa?dl=0
"D2_RDF_KnowlegeBase" SVDE-Phase2-Deliverable2
(the Cluster Knowledge Base itself):
https://www.dropbox.com/sh/cw5cgspkgm72xwf/AACTgRSpe7LXOkTik9qYD3kVa?dl=0
"D3_RDF_external_URIs" SVDE-Phase2-Deliverable3_Stanford
(our catalog converted to Bibframe 2.0 quad store with external URIs included):
https://www.dropbox.com/sh/zo388dunl4gqmb2/AAC00syLbjZUymPuEeOvzOxna?dl=0
@jgreben Successfully downloaded and decompressed the files. I'm currently building a triplestore from deliverable 3...
@jgreben No joy: INFO Load: /Volumes/Pegasus3/LD4L/vde/stanford_3/stanford_06.nq -- 2018/09/19 15:37:55 CDT ... ERROR [line: 18081849, col: 172] Illegal character in IRI (codepoint 0x7D, '}'): http://id.loc.gov/vocabulary/geographicAreas/n-cn---[}]...
This is similar to the issues I ran into with the LoC data from this summer. Apache Jena appears to be a stickler regarding the character specs.
@eichmann looks like you split the file into smaller sets? If so, and if there are particular files that will not load because of character issues, perhaps you can just skip those for now. As long as we have a decent body of sample data to lookup, that will probably work for us.
If most of the files are not loadable because, maybe split them up into smaller chunks and do some remediation with sed
or something like that? Or let me know and I will communicate that to Tiziana and see if they can remediate and upload a new compressed file. It would also be helpful to have a summary of the bad character types to send them.
@jgreben each of the deliverable files is a compressed tar file. #3 contains 43 individual nq files, for instance. I'm running a validation check on the #3 files right now.
@jgreben I'm getting ready to redo the triplestore - does it make sense to generate a single one for all (valid) data from the three deliverables, or to generate 3 - one for each of the deliverables?
@eichmann One triplestore would be fine, as long as the cluster links are also returned along with the QA query for an entity. If deliverable 3 also includes the cluster kb links, I'm not sure we need both 1 and 3.
I've scrubbed & patched the ill-formed triples from deliverables 2 & 3 of the Stanford data and am building a combined triplestore now. I do have one build from 3 with files 6, 9 and 16_01 left out, but sounds like, given @jgreben's comment, that 3 by itself isn't of much use, and 1 appears to be an early version of 3. @jgreben can you confirm my hypothesis about 1 & 3?
Standard batch interface is now available as stanford_share_vde_work_batch.jsp and stanford_share_vde_instance_batch.jsp. We'll need to discuss result contents with the group. Some hits yield 4 triples, some 14k triples...
Waiting on data from Casalini
From Dave in Slack...
UCSD SHARE data are up and running - two query interfaces: work and instance. This is basically a clone of the Stanford SHARE configuration.
@eichmann Can you provide...
Pending completion of Cornell data ingest. At that point, the QA config will be created.
Connection points for partial content of Cornell works and instances are at
http://services.ld4l.org/ld4l_services/cornell_share_vde_work_batch.jsp http://services.ld4l.org/ld4l_services/cornell_share_vde_instance_batch.jsp
with standard arguments. Individual term results available at
http://services.ld4l.org/ld4l_services/cornell_share_vde_work_lookup.jsp http://services.ld4l.org/ld4l_services/cornell_share_vde_instance_lookup.jsp
curl 'http://services.ld4l.org/ld4l_services/cornell_share_vde_work_batch.jsp?query=twain&maxRecords=2'
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> <http://vivoweb.org/ontology/core#rank> "1" .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#authoritativeLabel Interpersonal relations--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/Topic/8c1b435c-f54c-391e-92cd-8715f823e53d .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Topic .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/2000/01/rdf-schema#label Interpersonal relations--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#authoritativeLabel Marriage--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/Topic/2e9f9323-3580-3c88-a416-786b61de35a7 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Topic .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/2000/01/rdf-schema#label Marriage--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#authoritativeLabel Young adults--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/Topic/1c4d2a95-e0a9-3492-a7cd-a96743cf3e81 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Topic .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/2000/01/rdf-schema#label Young adults--Fiction. .
etc.
Much more data than this. Full results in gist: https://gist.github.com/elrayle/db74b3a1baa50649ac337ab0f1ec9307
These all share the same subject URI, but based on authoritative label and schema label, the sample data shown above looks like it is for 3 different records.
my typo patch is deployed for all SHARE VDE sources, both work and instance (and a number of others no one has complained about…)
@eichmann When I do a search request from cornell_share_vde_work_batch, the first triple correctly surrounds URIs with <>. After that all other triples are missing <> around predicates and quotes around strings.
$ curl -L -D - -H 'Accept: application/n-triples' 'http://services.ld4l.org/ld4l_services/cornell_share_vde_work_batch.jsp?query=twain&maxRecords=2&lang=en'
HTTP/1.1 200
Date: Fri, 08 Mar 2019 15:15:25 GMT
Server: Apache/2.4.33 (Unix)
Content-Type: application/n-triples;charset=UTF-8
Set-Cookie: JSESSIONID=A6C94E1312B460270E2EE22EECBC55BB;path=/ld4l_services;HttpOnly
Transfer-Encoding: chunked
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> <http://vivoweb.org/ontology/core#rank> "1" .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#authoritativeLabel Interpersonal relations--Fiction. .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/Topic/8c1b435c-f54c-391e-92cd-8715f823e53d .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Topic .
etc.
@eichmann When I try to fetch a single work from cornell_share_vde_work_lookup, I see the same problem where all triples, including the first, are missing <> around predicates and quotes around strings.
$ curl -L -D - -H 'Accept: application/n-triples' 'http://services.ld4l.org/ld4l_services/cornell_share_vde_work_lookup.jsp?uri=http://share-vde.org/sharevde/rdfBibframe/Work/1635833'
HTTP/1.1 200
Date: Fri, 15 Mar 2019 13:10:04 GMT
Server: Apache/2.4.33 (Unix)
Content-Type: application/n-triples;charset=UTF-8
Set-Cookie: JSESSIONID=A6884F8CFB1C66DC2A4919ECD4185C7E;path=/ld4l_services;HttpOnly
Transfer-Encoding: chunked
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#authoritativeLabel Interpersonal relations--Fiction. .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
http://share-vde.org/sharevde/rdfBibframe/Contribution/9218d195-db1f-4420-9f91-d684110cca06 http://id.loc.gov/ontologies/bibframe/agent http://share-vde.org/sharevde/rdfBibframe/Agent/1291372 .
http://share-vde.org/sharevde/rdfBibframe/Contribution/9218d195-db1f-4420-9f91-d684110cca06 http://id.loc.gov/ontologies/bibframe/role http://id.loc.gov/vocabulary/relators/ctb .
http://share-vde.org/sharevde/rdfBibframe/Contribution/9218d195-db1f-4420-9f91-d684110cca06 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Contribution .
etc.
Resolved. There's an attribute in my SPARQL query tag that allow turning on and off the syntactic sugaring around literals and URIs so I can render them in tables as well as return parsable triples. The flag was just set the wrong way. Things should be working properly across the board for works and instances for all SHARE sources.
I am closing this issue as the initial implementation is deployed on production.
Remaining work:
Datasource: Share-VDE
Request: Consider caching Share-VDE data in services.ld4l.org.
Expected Workflow: