LD4P / qa_server

A rails engine with questioning authority gem installed to serve as an authority search server with normalized results.
Apache License 2.0
6 stars 3 forks source link

Request datasource Share-VDE be supported by QaServer #16

Closed elrayle closed 5 years ago

elrayle commented 6 years ago

Datasource: Share-VDE

Request: Consider caching Share-VDE data in services.ld4l.org.

Expected Workflow:

elrayle commented 6 years ago
jgreben commented 6 years ago

@eichmann See if you can download the files using these links. If you do not have the right permissions I will ask the Casalini famiglia if they can add you, as this may be the easiest way to get the files to you. I think you should only need D2 and D3, but D1 is Stanford's basic set of triples. Explore the data and see what you think:

"D1_RDF_SVDE_URIs" stanford_nt (our catalog converted to Bibframe 2.0, with links to the Share-VDE Cluster Knowledge Base): https://www.dropbox.com/sh/l3kh8fgblcymplm/AAAfKvaI7WYtKHkjhupsGeHQa?dl=0

"D2_RDF_KnowlegeBase" SVDE-Phase2-Deliverable2 (the Cluster Knowledge Base itself): https://www.dropbox.com/sh/cw5cgspkgm72xwf/AACTgRSpe7LXOkTik9qYD3kVa?dl=0

"D3_RDF_external_URIs" SVDE-Phase2-Deliverable3_Stanford (our catalog converted to Bibframe 2.0 quad store with external URIs included): https://www.dropbox.com/sh/zo388dunl4gqmb2/AAC00syLbjZUymPuEeOvzOxna?dl=0

eichmann commented 6 years ago

@jgreben Successfully downloaded and decompressed the files. I'm currently building a triplestore from deliverable 3...

eichmann commented 6 years ago

@jgreben No joy: INFO Load: /Volumes/Pegasus3/LD4L/vde/stanford_3/stanford_06.nq -- 2018/09/19 15:37:55 CDT ... ERROR [line: 18081849, col: 172] Illegal character in IRI (codepoint 0x7D, '}'): http://id.loc.gov/vocabulary/geographicAreas/n-cn---[}]...

This is similar to the issues I ran into with the LoC data from this summer. Apache Jena appears to be a stickler regarding the character specs.

jgreben commented 6 years ago

@eichmann looks like you split the file into smaller sets? If so, and if there are particular files that will not load because of character issues, perhaps you can just skip those for now. As long as we have a decent body of sample data to lookup, that will probably work for us.

If most of the files are not loadable because, maybe split them up into smaller chunks and do some remediation with sed or something like that? Or let me know and I will communicate that to Tiziana and see if they can remediate and upload a new compressed file. It would also be helpful to have a summary of the bad character types to send them.

eichmann commented 6 years ago

@jgreben each of the deliverable files is a compressed tar file. #3 contains 43 individual nq files, for instance. I'm running a validation check on the #3 files right now.

eichmann commented 5 years ago

@jgreben I'm getting ready to redo the triplestore - does it make sense to generate a single one for all (valid) data from the three deliverables, or to generate 3 - one for each of the deliverables?

jgreben commented 5 years ago

@eichmann One triplestore would be fine, as long as the cluster links are also returned along with the QA query for an entity. If deliverable 3 also includes the cluster kb links, I'm not sure we need both 1 and 3.

eichmann commented 5 years ago

I've scrubbed & patched the ill-formed triples from deliverables 2 & 3 of the Stanford data and am building a combined triplestore now. I do have one build from 3 with files 6, 9 and 16_01 left out, but sounds like, given @jgreben's comment, that 3 by itself isn't of much use, and 1 appears to be an early version of 3. @jgreben can you confirm my hypothesis about 1 & 3?

eichmann commented 5 years ago

Standard batch interface is now available as stanford_share_vde_work_batch.jsp and stanford_share_vde_instance_batch.jsp. We'll need to discuss result contents with the group. Some hits yield 4 triples, some 14k triples...

jak473 commented 5 years ago

Waiting on data from Casalini

elrayle commented 5 years ago

From Dave in Slack...

UCSD SHARE data are up and running - two query interfaces: work and instance. This is basically a clone of the Stanford SHARE configuration.

elrayle commented 5 years ago

@eichmann Can you provide...

elrayle commented 5 years ago

Pending completion of Cornell data ingest. At that point, the QA config will be created.

eichmann commented 5 years ago

Connection points for partial content of Cornell works and instances are at

http://services.ld4l.org/ld4l_services/cornell_share_vde_work_batch.jsp http://services.ld4l.org/ld4l_services/cornell_share_vde_instance_batch.jsp

with standard arguments. Individual term results available at

http://services.ld4l.org/ld4l_services/cornell_share_vde_work_lookup.jsp http://services.ld4l.org/ld4l_services/cornell_share_vde_instance_lookup.jsp

elrayle commented 5 years ago

Query

curl 'http://services.ld4l.org/ld4l_services/cornell_share_vde_work_batch.jsp?query=twain&maxRecords=2'

Results

<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> <http://vivoweb.org/ontology/core#rank> "1" .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#authoritativeLabel Interpersonal relations--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/Topic/8c1b435c-f54c-391e-92cd-8715f823e53d .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Topic .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/2000/01/rdf-schema#label Interpersonal relations--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#authoritativeLabel Marriage--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/Topic/2e9f9323-3580-3c88-a416-786b61de35a7 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Topic .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/2000/01/rdf-schema#label Marriage--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#authoritativeLabel Young adults--Fiction. .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/Topic/1c4d2a95-e0a9-3492-a7cd-a96743cf3e81 .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Topic .
<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> http://www.w3.org/2000/01/rdf-schema#label Young adults--Fiction. .
etc.

Much more data than this. Full results in gist: https://gist.github.com/elrayle/db74b3a1baa50649ac337ab0f1ec9307

Observations

These all share the same subject URI, but based on authoritative label and schema label, the sample data shown above looks like it is for 3 different records.

eichmann commented 5 years ago

my typo patch is deployed for all SHARE VDE sources, both work and instance (and a number of others no one has complained about…)

elrayle commented 5 years ago

@eichmann When I do a search request from cornell_share_vde_work_batch, the first triple correctly surrounds URIs with <>. After that all other triples are missing <> around predicates and quotes around strings.

$ curl -L -D - -H 'Accept: application/n-triples' 'http://services.ld4l.org/ld4l_services/cornell_share_vde_work_batch.jsp?query=twain&maxRecords=2&lang=en'
HTTP/1.1 200 
Date: Fri, 08 Mar 2019 15:15:25 GMT
Server: Apache/2.4.33 (Unix)
Content-Type: application/n-triples;charset=UTF-8
Set-Cookie: JSESSIONID=A6C94E1312B460270E2EE22EECBC55BB;path=/ld4l_services;HttpOnly
Transfer-Encoding: chunked

<http://share-vde.org/sharevde/rdfBibframe/Work/1635833> <http://vivoweb.org/ontology/core#rank> "1" .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#authoritativeLabel Interpersonal relations--Fiction. .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/Topic/8c1b435c-f54c-391e-92cd-8715f823e53d .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Topic .
etc.    
elrayle commented 5 years ago

@eichmann When I try to fetch a single work from cornell_share_vde_work_lookup, I see the same problem where all triples, including the first, are missing <> around predicates and quotes around strings.

$ curl -L -D - -H 'Accept: application/n-triples' 'http://services.ld4l.org/ld4l_services/cornell_share_vde_work_lookup.jsp?uri=http://share-vde.org/sharevde/rdfBibframe/Work/1635833'
HTTP/1.1 200 
Date: Fri, 15 Mar 2019 13:10:04 GMT
Server: Apache/2.4.33 (Unix)
Content-Type: application/n-triples;charset=UTF-8
Set-Cookie: JSESSIONID=A6884F8CFB1C66DC2A4919ECD4185C7E;path=/ld4l_services;HttpOnly
Transfer-Encoding: chunked

http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#authoritativeLabel Interpersonal relations--Fiction. .
http://id.loc.gov/authorities/subjects/sh2008104850 http://www.loc.gov/mads/rdf/v1#elementValue http://share-vde.org/sharevde/rdfBibframe/GenreForm/c3f11f42-7a06-3004-ada7-bc1c37c5a3e1 .
http://share-vde.org/sharevde/rdfBibframe/Contribution/9218d195-db1f-4420-9f91-d684110cca06 http://id.loc.gov/ontologies/bibframe/agent http://share-vde.org/sharevde/rdfBibframe/Agent/1291372 .
http://share-vde.org/sharevde/rdfBibframe/Contribution/9218d195-db1f-4420-9f91-d684110cca06 http://id.loc.gov/ontologies/bibframe/role http://id.loc.gov/vocabulary/relators/ctb .
http://share-vde.org/sharevde/rdfBibframe/Contribution/9218d195-db1f-4420-9f91-d684110cca06 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://id.loc.gov/ontologies/bibframe/Contribution .

etc.
eichmann commented 5 years ago

Resolved. There's an attribute in my SPARQL query tag that allow turning on and off the syntactic sugaring around literals and URIs so I can render them in tables as well as return parsable triples. The flag was just set the wrong way. Things should be working properly across the board for works and instances for all SHARE sources.

elrayle commented 5 years ago

I am closing this issue as the initial implementation is deployed on production.

Remaining work: