ankitadhandha / ontomatica-linter

The Unlicense
1 stars 0 forks source link

SDL issues as of now #4

Open ankitadhandha opened 3 years ago

ankitadhandha commented 3 years ago

Hi Gregg; Ankita here!

I am writing to update you about implementing SDL using AWS services, and proposed changes to SDL. These are my observations as of October 28. As I am sure you understand, issues and observation change almost daily as we solve many and encounter more issues customizing SDL.

0

Background I am managing three SDL versions: A as-is [implemented on AWS/Lambda] B to-be [implemented on AWS/Lambda] C as-is [implemented on AWS/EC2]

A implements SDL as it is served by your site. B implements a model that is conceptually illustrated here: https://afdsi.com/sdl/prototype/ C runs very large @graphs (that take hours to execute) on an AWS/EC2 server.

In order to implement B, we use SDL components to generate different views of structured data.

Version A is seen here: https://5hw2bg5kz3.execute-api.us-east-1.amazonaws.com/Prod/

Version B is seen here: https://t8oykz4bta.execute-api.us-east-1.amazonaws.com/Prod

Version C is currently offline (as it is expensive to run).

There are differences, and issues, as discussed below.

1

File size and graph complexity create problems when using AWS/Lambda.

Google SDTT file size limit is 2.5 MB. However, SDTT will process significantly complex graphs for a file size < 2.5 MB (graphs that massively denormalize statements declared using @id and that are organized as third normal form - as illustrated in our examples below).

In contrast, there is no (known) limit when running SDL on a dedicated EC2 server (Version C) We have processed a large file size with 900,000+ triples (statements).

But there are issues when implementing SDL on AWS/Lambda. I'll simplify the issues here, and we can review detail later.

Version A has 4 components. Version B has 8 components.

We can run moderately sized graphs on Version A. Here are two examples: Data source: https://afdsi.com/sdl/language/data.json Lint: https://5hw2bg5kz3.execute-api.us-east-1.amazonaws.com/Prod/?url=https:%2F%2Fafdsi.com%2Fsdl%2Flanguage%2Fdata.json

Data source: https://afdsi.com/sdl/organization/data.json Lint: https://5hw2bg5kz3.execute-api.us-east-1.amazonaws.com/Prod/?url=https:%2F%2Fafdsi.com%2Fsdl%2Forganization%2Fdata.json

Neither will execute here: http://linter.structured-data.org/

Both are processed here: https://search.google.com/structured-data/testing-tool/u/0/ SDTT https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fafdsi.com%2Fsdl%2Flanguage%2Fdata.json https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fafdsi.com%2Fsdl%2Forganization%2Fdata.json

However, none of the above can be processed using Version B (8 components). This working example is much smaller: http://linter.structured-data.org/examples/schema.org/Book-CreativeWork-accessibilityFeature-accessibilityHazard-accessibilityControl-accessibilityAPI-226-jsonld.html Anything much bigger will fail (Null result)

Bottom line: File size and graph complexity create problems when using AWS/Lambda

Issue: Running SDL on AWS/Lambda is low cost but dependent on the features planned for Version B. Running SDL Version B on the appropriate AWS/EC2 server is expensive but (nearly) limitless. How to prepare the market for a bifurcated solution?

2

We generate a visual graph from RDF and store the image in a /tmp folder. The intention is to display the visual in a grid cell (see below). AWS/Lambda will write to a /tmp folder. However, as far as we know, SDL does not support access to (construct the route to) a /tmp folder.

Issue: How to systematically access and display images from a /tmp folder?

We have a workaround, as you see in the example. But we're concerned that tmp files will be over-written when multiple users run SDL at the same time or near time. We're uncertain how to deal with semi-persistent data.

3

If API fails to return a graph, AWS/Lambda returns an error "Broken Pipe"

Issue: How to implement exception handling for API responses?

4

Build a dialog box for clicking a grid-cell as shown in https://afdsi.com/sdl/prototype/

Issue: Need to build a preview, fitting a grid cell, that opens full screen in a light-box.

5

How to create a hierarchy of linked tags for display in a dendrogram-like tree?

Issue: The hierarchy should be presented in an accordion.

6

SDTT enables debugging using two screens: source (left) and errors (right). Errors in source are highlighted. User can repair source errors and rerun SDTT.

Issue: Is a practical variation possible with SDL?

7

SDL footprint. We may achieve better AWS/Lambda operation and performance if we can reduce the size of SDL.

Issue: Are there configurations (flavors) of SDL that we can build for different services? For example, can we configure (and explain in documentation): SDL Lite SDL Standard SDL Full

Those configurations might be associated with a free tier (AWS/Lambda); and then reasonable pricing for linting on larger servers.

8

Import new schema.org and other ontologies (e.g. SKOS)

We need to update SDL at the same frequency as schema.org is released. We also would like to integrate other ontologies for validation, such as SKOS. We think, but are not certain, that an ontology @import will produce better results than initalizing @context with links to other vocabularies. Like schema.org, SKOS has a grammar that needs to be enforced in a linter/validator. If our assumption is correct, we need documentation about how to systematically import other ontologies. Such a feature would significantly differentiate SDL from SDTT, and would be a valuable service to the RDF community.

9

SDL integrates a reasoner. https://rubygems.org/gems/rdf-reasoner

It might be useful to display "reasoned recommendations". In other words, if reasoning returns a recommendation from a linting-session, how to present the recommendation to the user? A user may accept the reasoning (i.e. the addition of missing information based on related analysis) and add the missing information to the graph. Or not - ignore the advice as they also might ignore SDL error messages.

Issue: How to display reasoning? How to integrate reasoned recommendations with a non-persistent SDL file, and then re-lint?

10

We suspect that you may have a TODO list. And, like Jarno, other users over time also may have suggested improvements to SDL. We'd like to build one list to have a better long term SDL vision and plan.

For discussion

Ankita Dhandha

gkellogg commented 3 years ago

1

File size and graph complexity create problems when using AWS/Lambda.

Bottom line: File size and graph complexity create problems when using AWS/Lambda

One thing to experiment with would be to use jsonld.js on the client to parse embedded JSON-LD, and pass n-quads output to the server (using the toRdf method), which would offload quite a bit of the processing burden, but wouldn't help with RDFa or Microdata, but you could detect this on the client.

Issue: Running SDL on AWS/Lambda is low cost but dependent on the features planned for Version B. Running SDL Version B on the appropriate AWS/EC2 server is expensive but (nearly) limitless. How to prepare the market for a bifurcated solution?

Perhaps a model similar to travis-ci.org vs travis-ci.com, where the .com version is commercial typically requiring a paid license, and the .org version is free for non-commercial use, but has some limitations.

2

We generate a visual graph from RDF and store the image in a /tmp folder. The intention is to display the visual in a grid cell (see below). AWS/Lambda will write to a /tmp folder. However, as far as we know, SDL does not support access to (construct the route to) a /tmp folder.

Issue: How to systematically access and display images from a /tmp folder?

You might use a cookie with a UUID, which can be used for naming temporary files, although you'll need to garbage collect this after the session ends, or based on last-accessed-time if you think the graph might be shared outside of the linter.

We have a workaround, as you see in the example. But we're concerned that tmp files will be over-written when multiple users run SDL at the same time or near time. We're uncertain how to deal with semi-persistent data.

Another idea would be to name the temporary file based on a hash of the input graph; RDF::Statement supports a #hash method that might work for this. You can use repo.statements.hash, which should do the trick, but we should change to use the rdf-ordered-repo, instead of RDF::Repository, which will give more predictable results. That would be a simple change to the linter.

3

If API fails to return a graph, AWS/Lambda returns an error "Broken Pipe"

Issue: How to implement exception handling for API responses?

If the event loop is separate for accessing the graph, perhaps you could detected a failed promise and give the user an option to re-try.

4

Build a dialog box for clicking a grid-cell as shown in https://afdsi.com/sdl/prototype/

Issue: Need to build a preview, fitting a grid cell, that opens full screen in a light-box.

I think you should be able to do this entirely client-side.

5

How to create a hierarchy of linked tags for display in a dendrogram-like tree?

Issue: The hierarchy should be presented in an accordion.

👍

6

SDTT enables debugging using two screens: source (left) and errors (right). Errors in source are highlighted. User can repair source errors and rerun SDTT.

Issue: Is a practical variation possible with SDL?

Highlighting errors in source is hard, particularly due to the fact that the JSON-LD processor does not retain line-numbers. Presumably, a version could do this, if a JSON parser retained the information somehow, but otherwise a difficult problem. I think you can get it from RDFa by using the processor graph, but not Microdata.

However, the JSON-LD Playground is able to do this (for JSON-LD), which may be another argument for offloading the JSON-LD parsing to jsonld.js. You could adopt large parts of the JSON-LD Playground, for which there should be no problem in sharing. (See https://github.com/json-ld/json-ld.org/tree/master/playground).

7

SDL footprint. We may achieve better AWS/Lambda operation and performance if we can reduce the size of SDL.

Issue: Are there configurations (flavors) of SDL that we can build for different services? For example, can we configure (and explain in documentation): SDL Lite SDL Standard SDL Full

Those configurations might be associated with a free tier (AWS/Lambda); and then reasonable pricing for linting on larger servers.

The SDL Gemfile pulls in a number of parsers, and you could simplify by using only those required. If you decide to offload JSON-LD parsing to the client, and sent N-Quads to the server, then you wouldn't need any other parsers, as the N-Quads parser is implemented in the RDF gem.

Also, the rdf-vocab vocabulary not only includes a large number of vocabularies, but fields and documentation that aren't too useful in the linter itself (although, using some vocabulary information for display might be useful). With some updates, we could create a different serialization of those vocabularies that was slimmer.

8

Import new schema.org and other ontologies (e.g. SKOS)

We need to update SDL at the same frequency as schema.org is released. We also would like to integrate other ontologies for validation, such as SKOS. We think, but are not certain, that an ontology @import will produce better results than initalizing @context with links to other vocabularies. Like schema.org, SKOS has a grammar that needs to be enforced in a linter/validator. If our assumption is correct, we need documentation about how to systematically import other ontologies. Such a feature would significantly differentiate SDL from SDTT, and would be a valuable service to the RDF community.

The OWL/RDFS versions of the vocabularies are necessary for semantic reasoning, which is where the power of the SDL to display domain/range errors comes from, and to identify inferred triples. rdf-vocab has many vocabularies, and could certainly have more, but at an increasing runtime cost, and image size. Again, this could be simplified. This does include SKOS, but you may want some SKOS reasoning to be added to the rdf-reasoner gem.

Adding new ontologies/vocabularies to the rdf-vocab gem is straightforward: basically, add them to lib/rdf/vocab.rb and run the rake task. This could certainly be automated using a Github action.

9

SDL integrates a reasoner. https://rubygems.org/gems/rdf-reasoner

It might be useful to display "reasoned recommendations". In other words, if reasoning returns a recommendation from a linting-session, how to present the recommendation to the user? A user may accept the reasoning (i.e. the addition of missing information based on related analysis) and add the missing information to the graph. Or not - ignore the advice as they also might ignore SDL error messages.

The reasoning information is currently returned as part of the JSON sent back to the SDL client; I don't recall off the top of my head if the syntax issues are returned the same way, but it should be easy to identify (or could be made to be) in the JSON.

Issue: How to display reasoning? How to integrate reasoned recommendations with a non-persistent SDL file, and then re-lint?

Triples implied by the reasoner are tagged with an implied property, which could be accessed while creating the grid view to trigger display classes.

10

We suspect that you may have a TODO list. And, like Jarno, other users over time also may have suggested improvements to SDL. We'd like to build one list to have a better long term SDL vision and plan.

I would like to see the core logic remain in https://github.com/structured-data/linter and manned using its issue list; this has proven to be an effective mechanism for driving change.

One thing I've mentioned is to integrate web components, so that the linter parses the expanded DOM. This could be easier to do if done client-side, so exploring a client-side parser seems like a good solution to many problems.