comunica / comunica-feature-link-traversal

📬 Comunica packages for link traversal-based query execution
Other
8 stars 11 forks source link

Link queue wrapper to record occupancy information in a file #129

Closed constraintAutomaton closed 1 month ago

constraintAutomaton commented 3 months ago

Objective

The objective of this actor is to collect information about the link queue and to serialize it into a file. The information collected are the following:

It is also possible to add custom link queue properties via the IOptionalLinkQueueParameters, but for their parsing the code has to be modified.

A label for each current reachability criteria also has been added. They can optionally label each link coming out of the extract-link actors. This feature doesn't only have the purpose to document the link queue occupancy but can be use in future work for engine optimization such has link queue ordering.

The @comunica/actor-rdf-resolve-hypermedia-links-queue-wrapper-info-occupance:query-identifier property can also be used to identify the query.

Example output

Config of the wrapper actor

{
  "@context": [
    "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/runner/^3.0.0/components/context.jsonld",

    "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-rdf-resolve-hypermedia-links-queue-wrapper-info-occupancy/^0.0.0/components/context.jsonld"
  ],
  "@id": "urn:comunica:default:Runner",
  "@type": "Runner",
  "actors": [
    {
      "@id": "urn:comunica:default:rdf-resolve-hypermedia-links-queue/actors#wrapper-info-occupancy",
      "@type": "ActorRdfResolveHypermediaLinksQueueWrapperInfoOccupancy",
      "beforeActors": { "@id": "urn:comunica:default:rdf-resolve-hypermedia-links-queue/actors#fifo" },
      "mediatorRdfResolveHypermediaLinksQueue": { "@id": "urn:comunica:default:rdf-resolve-hypermedia-links-queue/mediators#main" },
      "filePath": "./link_queue_info.json"
    }
  ]
}

Config of the engine

{
    "@context": [
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/config-query-sparql/^2.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/config-query-sparql-link-traversal/^0.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-extract-links-predicates/^0.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-extract-links-solid-type-index/^0.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-extract-links-quad-pattern-query/^0.0.0/components/context.jsonld",
        "https://linkedsoftwaredependencies.org/bundles/npm/@comunica/runner/^3.0.0/components/context.jsonld"
    ],
    "import": [
        "ccqslt:config/config-solid-base.json",
        "ccqslt:config/rdf-resolve-hypermedia-links-queue/actors/wrapper-info-occupancy.json"
    ],
    "@graph": [
        {
            "@id": "urn:comunica:default:Runner",
            "@type": "Runner",
            "actors": [
                {
                    "@id": "urn:comunica:default:extract-links/actors#predicates-common",
                    "@type": "ActorExtractLinksPredicates",
                    "checkSubject": false,
                    "predicateRegexes": [
                        "http://www.w3.org/2000/01/rdf-schema#seeAlso",
                        "http://www.w3.org/2002/07/owl##sameAs",
                        "http://xmlns.com/foaf/0.1/isPrimaryTopicOf"
                    ],
                    "labelLinksWithReachability":true
                },
                {
                    "@id": "urn:comunica:default:extract-links/actors#predicates-ldp",
                    "@type": "ActorExtractLinksPredicates",
                    "checkSubject": true,
                    "predicateRegexes": [
                        "http://www.w3.org/ns/ldp#contains"
                    ],
                    "labelLinksWithReachability":true
                },
                {
                    "@id": "urn:comunica:default:extract-links/actors#predicates-solid",
                    "@type": "ActorExtractLinksPredicates",
                    "checkSubject": true,
                    "predicateRegexes": [
                        "http://www.w3.org/ns/pim/space#storage"
                    ],
                    "labelLinksWithReachability":true
                },
                {
                    "@id": "urn:comunica:default:extract-links/actors#quad-pattern-query",
                    "@type": "ActorExtractLinksQuadPatternQuery",
                    "labelLinksWithReachability":true
                },
                {
                    "@id": "urn:comunica:default:extract-links/actors#solid-type-index",
                    "@type": "ActorExtractLinksSolidTypeIndex",
                    "inference": false,
                    "mediatorDereferenceRdf": {
                        "@id": "urn:comunica:default:dereference-rdf/mediators#main"
                    },
                    "labelLinksWithReachability":true
                }
            ]
        }
    ]
}

output

{
    "iris_popped": [
        {
            "url": "http://localhost:3000/pods/00000000000000000933/",
            "reachability_criteria": "cSolidStorage",
            "timestamp": 1711439007006
        },
        {
            "url": "http://localhost:3000/www.ldbc.eu/ldbc_socialnet/1.0/data/forum00000001099511627784",
            "reachability_criteria": "cCommon",
            "timestamp": 1711439008917
        }
    ],
    "iris_pushed": [
        {
            "url": "http://localhost:3000/pods/00000000000000000933/",
            "reachability_criteria": "cSolidStorage",
            "timestamp": 1711439007006
        },
        {
            "url": "http://localhost:3000/dbpedia.org/resource/Kelaniya",
            "reachability_criteria": "cMatch",
            "timestamp": 1711439007006
        },
        {
            "url": "http://localhost:3000/www.ldbc.eu/ldbc_socialnet/1.0/data/forum00000001099511627784",
            "reachability_criteria": "cCommon",
            "timestamp": 1711439008916
        }
    ],
    "started_empty": true,
    "query": {
        "type": "project",
        "input": {
            ...
         }
    }
}

If an identifier was given for the query than the query field will be this identifier.

Future works

rubensworks commented 3 months ago

Before I review in detail, could you @RubenEschauzier first have a look at this as well? This feels somewhat related to your prioritization work.

constraintAutomaton commented 3 months ago

I don't know why the CI doesn't pass on my end it does. Even the solidbench the one which is the one failing.

rubensworks commented 3 months ago

I don't know why the CI doesn't pass on my end it does. Even the solidbench the one which is the one failing.

The failure looks unrelated to your PR.

rubensworks commented 3 months ago

Another thought I have that would simplify this PR: Instead of writing to file, we could just write to the logger (e.g. at TRACE level). An external script could then be used to filter the log messages related to the link queue. This would make this PR work in the browser as well.

RubenEschauzier commented 3 months ago

Looks pretty good, it is a much more refined version of what I used for the link queue analysis paper. I'm unsure if it has much to do with prioritization, as (as far as I can tell) it is mainly for analyzing query execution in hindsight.

constraintAutomaton commented 3 months ago

Looks pretty good, it is a much more refined version of what I used for the link queue analysis paper. I'm unsure if it has much to do with prioritization, as (as far as I can tell) it is mainly for analyzing query execution in hindsight.

I think it is the part about the labeling of the links with the reachability criteria. At least for my work I plan to use those label to prioritize some links.

constraintAutomaton commented 3 months ago

Another thought I have that would simplify this PR: Instead of writing to file, we could just write to the logger (e.g. at TRACE level). An external script could then be used to filter the log messages related to the link queue. This would make this PR work in the browser as well.

Yes, I think it is much better! I will be also able to provide the information in a streaming manner (instead of dumping it all at every step, but I think I will provide the option to do so). I will make another repo to parse this information has a file because it make it easy for my own investigation and debugging.