LD4P / graph-explorer

Proof-of-concept for loading Resources from Sinopia and running SPARQL queries
3 stars 0 forks source link

Graph summary results #6

Open kallimathios opened 3 months ago

kallimathios commented 3 months ago

I receive different numbers for the total number of triples when utilizing the graph summary feature of the graph explorer. I receive different results when I rebuild the graph and duplicate my actions without any changes to the group or environment, and I also receive different results when I restart the environment with a hard refresh. Additionally, results seem to vary when I navigate between groups within an environment. The below examples cover these scenarios.

The following example comes from running the summary in the Development environment for the "All" group. I received two different results without navigating to another group or restarting the environment:

Screenshot 2024-06-20 at 1 05 49 PM Screenshot 2024-06-20 at 1 07 16 PM

I then tried restarting the environment, and received another different set of results:

Screenshot 2024-06-20 at 1 12 29 PM

While I did not get a screenshot, at one point the system returned 643,169 triples for the "All" group in Development.

I restarted the environment with a hard refresh and generated a summary for the Stage environment and All groups, with the following triples returned:

Screenshot 2024-06-20 at 1 16 51 PM

I then navigated to the next group, California State University, and built the graph within the Stage environment, then went back to the All group, I received these numbers:

Screenshot 2024-06-20 at 1 22 13 PM

I restarted the environment with a hard refresh and tried to generate a graph summary in the Production environment for the "All" group. I received the following two results without navigating to any other groups or restarting the environment.

Screenshot 2024-06-20 at 1 39 32 PM Screenshot 2024-06-20 at 1 40 48 PM
jermnelson commented 3 months ago

Thanks @kallimathios for the detailed ticket! When investigating this issue late last week, at least a partial cause of the different numbers when loading the same graph, comes down to the presence of ordered rdf:List used for ordering triples in the Resource Templates. A short synopsis of how rdf:Lists are implemented as a series of blank nodes that together with the rdf:first and rdf:rest predicates, generate an ordered list.

These rdf:List intermediary blank-nodes are not being skolemized correctly with deterministic URLs but each time the same RDF resource is loaded, these blank-nodes identifiers are being randomly generated by the python rdflib library and show up as new triples. You can replicate this happening by just loading a single URL of a resource template in the Graph Explorer (this example is using PCC Template https://api.development.sinopia.io/resource/pcc:bf2:Serial:Work).

Screenshot 2024-06-24 at 1 11 58 PM

Doing an initial load in graph explorer results in the following statistics:

Screenshot 2024-06-24 at 1 12 32 PM

We then can run a couple of queries to see how many triples contain rdf:first and rdf:rest Screenshot 2024-06-24 at 1 14 35 PM Screenshot 2024-06-24 at 1 15 32 PM

Now, if we click the Build button again for the same resource we see the number of triples increased to 477 from 422: Screenshot 2024-06-24 at 1 17 35 PM

Re-run the SPARQL queries to see how many triples contain rdf:first and rdf:rest: Screenshot 2024-06-24 at 1 18 10 PM

Screenshot 2024-06-24 at 1 20 54 PM

Taking a closer look at the rdf:first list of subjects and objects, you can see the actual blank-nodes (i.e. https://api.development.sinopia.io/resource/pcc:bf2:Serial:Work#b43) have duplicate subjects for the same rdf:first predicates.

I think a short-term fix is to just create a new graph every time the Build is clicked instead of trying to load the resource into the same graph. However, we will still need to address this problem as part of ticket 2.

kallimathios commented 3 months ago

Got it - this is super helpful. I will rebuild the graph each time. Also a needed reminder about the functionality to load and investigate a single resource. Thanks so much, @jermnelson !