SolidBench / rdf-dataset-fragmenter.js

Fragments an RDF dataset into multiple parts
MIT License
3 stars 8 forks source link

RDF Dataset Fragmenter

Build status Coverage Status npm version

This tool takes one or more datasets as input, and fragments it into multiple smaller datasets as output, based on the selected fragmentation strategy.

This reads and writes datasets in a streaming manner, to support datasets larger than memory.

Installation

$ npm install -g rdf-dataset-fragmenter

or

$ yarn global add rdf-dataset-fragmenter

Usage

Invoke from the command line

This tool can be used on the command line as rdf-dataset-fragmenter, which takes as single parameter the path to a config file:

$ rdf-dataset-fragmenter path/to/config.json

Config file

The config file that should be passed to the command line tool has the following JSON structure:

{
  "@context": "https://linkedsoftwaredependencies.org/bundles/npm/rdf-dataset-fragmenter/^2.0.0/components/context.jsonld",
  "@id": "urn:rdf-dataset-fragmenter:default",
  "@type": "Fragmenter",
  "quadSource": {
    "@type": "QuadSourceFile",
    "filePath": "path/to/dataset.ttl"
  },
  "fragmentationStrategy": {
    "@type": "FragmentationStrategySubject"
  },
  "quadSink": {
    "@type": "QuadSinkFile",
    "log": true,
    "outputFormat": "application/n-quads",
    "iriToPath": {
      "http://example.org/base/": "output/base/",
      "http://example.org/other/": "output/other/"
    }
  }
}

The important parts in this config file are:

In this example, the config file will read from the "path/to/dataset.ttl" file, employ subject-based fragmentation, and will write into files in the "output/" directory. For example, the triple <http://example.org/base/ex1> a <ex:thing> will be saved into the file output/base/ex1, while the triple <http://example.org/other/ex2> a <ex:thing> will be saved into the file output/other/ex2.

The available configuration components will be explained in more detail hereafter.

Configure

Quad Sources

A quad source is able to provide a stream of quads as input to the fragmentation process.

File Quad Source

A file quad source takes as parameter the path to a local RDF file.

{
  "quadSource": {
    "@type": "QuadSourceFile",
    "filePath": "path/to/dataset.ttl"
  }
}

Composite Quad Source

A composite quad source allows you to read from multiple quad sources in parallel.

{
  "quadSource": {
    "@type": "QuadSourceComposite",
    "sources": [
      {
        "@type": "QuadSourceFile",
        "filePath": "path/to/dataset1.ttl"
      },
      {
        "@type": "QuadSourceFile",
        "filePath": "path/to/dataset2.ttl"
      },
      {
        "@type": "QuadSourceFile",
        "filePath": "path/to/dataset3.ttl"
      }
    ]
  }
}

Fragmentation Strategy

A fragmentation strategy that fragments a stream of quads into different documents. Concretely, it takes quads from the source, and pipes them into a quad sink.

Subject Fragmentation Strategy

A fragmentation strategy that places quads into their subject's document.

{
  "fragmentationStrategy": {
    "@type": "FragmentationStrategySubject"
  }
}

Optionally, the relativePath property can be used to define a relative IRI that should be applied to the subject IRI before determining its document. This will not change the quad, only the document IRI.

Object Fragmentation Strategy

A fragmentation strategy that places quads into their object's document.

{
  "fragmentationStrategy": {
    "@type": "FragmentationStrategyObject"
  }
}

Composite Fragmentation Strategy

A fragmentation strategy that combines multiple strategies. This means that all the given strategies will be executed in parallel.

{
  "fragmentationStrategy": {
    "@type": "FragmentationStrategyComposite",
    "strategies": [
      { "@type": "FragmentationStrategySubject" },
      { "@type": "FragmentationStrategyObject" }
    ]
  }
}

Resource Object Fragmentation Strategy

A fragmentation strategy that groups triples by (subject) resources, and places quads into the document identified by the given predicate value.

Blank nodes are not supported.

{
  "fragmentationStrategy": {
    "@type": "FragmentationStrategyResourceObject",
    "targetPredicateRegex": "vocabulary/hasMaliciousCreator$"
  }
}

Exception Fragmentation Strategy

A fragmentation strategy that delegates quads to a base strategy, but allows defining exceptions that should be delegated to other strategies. These exceptions are defined in terms of a matcher (e.g. match by quad predicate).

The following config uses the subject-based strategy for everything, except for predicate1 and predicate2 that will be delegated to the object-based strategy.

{
  "fragmentationStrategy": {
    "@type": "FragmentationStrategyException",
    "strategy": {
      "@type": "FragmentationStrategySubject"
    },
    "exceptions": [
      {
        "@type": "FragmentationStrategyExceptionEntry",
        "matcher": {
          "@type": "QuadMatcherPredicate",
          "predicateRegex": "vocabulary/predicate1"
        },
        "strategy": {
          "@type": "FragmentationStrategyObject"
        }
      },
      {
        "@type": "FragmentationStrategyExceptionEntry",
        "matcher": {
          "@type": "QuadMatcherPredicate",
          "predicateRegex": "vocabulary/predicate2"
        },
        "strategy": {
          "@type": "FragmentationStrategyObject"
        }
      }
    ]
  }
}

Constant Fragmentation Strategy

A fragmentation strategy that delegates all quads towards a single path.

{
  "fragmentationStrategy": {
    "@type": "FragmentationConstant",
    "path": "http://localhost:3000/datadump"
  }
}

VoID Description Fragmentation Strategy

Fragmentation strategy that generates partial dataset descriptions using the standard VoID vocabulary. The dataset URIs are determined based on quad subject values using regular expressions.

{
  "fragmentationStrategy": {
    "@type": "FragmentationStrategyDatasetSummaryVoID",
    "datasetPatterns": [
      "^(.*\\/pods\\/[0-9]+\\/)"
    ]
  }
}

Bloom Filter Fragmentation Strategy

Fragmentation strategy that generates Bloom filters to capture co-occurrence of resources and properties, using the custom membership filter vocabulary. The filters are generated per-dataset, where the dataset URI is determined based on quad subject values using regular expressions. After generation, the summaries can be re-mapped to a different document URI.

{
  "fragmentationStrategy": {
    "@type": "FragmentationStrategyDatasetSummaryBloom",
    "hashBits": 256,
    "hashCount": 4,
    "datasetPatterns": [
      "^(.+\\/pods\\/[0-9]+\\/)"
    ],
    "locationPatterns": [
      "^(.+\\/pods\\/[0-9]+\\/)"
    ]
  }
}

Quad Sinks

A quad sink is able to direct a stream of quads as output from the fragmentation process.

File Quad Sink

A quad sink that writes to files using an IRI to local file system path mapping.

{
  "quadSink": {
    "@type": "QuadSinkFile",
    "log": true,
    "outputFormat": "application/n-quads",
    "fileExtension": "$.nq",
    "iriToPath": {
      "http://example.org/base/": "output/base/",
      "http://example.org/other/": "output/other/"
    }
  }
}

Options:

HDT Quad Sink

A quad sink that writes to files using an IRI to local file system path mapping and then converts the files into an HDT document. The implementation uses the docker image HDT-Docker of the hdt-cpp library. The docker operations to acquire the image and execute the transformations into HDT are performed by the sink.

WARNING: Can be very slow for many files

{
  "quadSink": {
    "@type": "QuadSinkHdt",
    "log": true,
    "outputFormat": "application/n-quads",
    "fileExtension": "$.nq",
    "iriToPath": {
      "http://example.org/base/": "output/base/",
      "http://example.org/other/": "output/other/"
    },
    "poolSize": 1,
    "deleteSourceFiles": false,
    "errorFileDockerRfdhdt": "./error_log_docker_rfdhdt.txt"
  }
}

Options:

Composite Quad Sink

A quad sink that combines multiple quad sinks.

{
  "quadSink": {
    "@type": "QuadSinkComposite",
    "sinks": [
      {
        "@type": "QuadSinkFile",
        "log": true,
        "outputFormat": "application/n-quads",
        "fileExtension": "$.nq",
        "iriToPath": {
          "http://example.org/base/": "output/base/",
          "http://example.org/other/": "output/other/"
        }
      },
      {
        "@type": "QuadSinkFile",
        "log": true,
        "outputFormat": "application/n-quads",
        "fileExtension": "$.nq2",
        "iriToPath": {
          "http://example.org/base/": "output-2/base/",
          "http://example.org/other/": "output-2/other/"
        }
      }
    ]
  }
}

Options:

Filtered Quad Sink

A quad sink that wraps over another quad sink and only passes the quads through that match the given filter.

{
  "quadSink": {
    "@type": "QuadSinkFiltered",
    "filter": {
      "@type": "QuadMatcherResourceType",
      "typeRegex": "vocabulary/Person$",
      "matchFullResource": false
    },
    "sink": [
      {
        "@type": "QuadSinkFile",
        "log": true,
        "outputFormat": "application/n-quads",
        "fileExtension": "$.nq",
        "iriToPath": {
          "http://example.org/base/": "output/base/",
          "http://example.org/other/": "output/other/"
        }
      }
    ]
  }
}

Options:

CSV Quad Sink

A quad sink that writes quads to a CSV file.

{
  "quadSink": {
    "@type": "QuadSinkCsv",
    "file": "../rdf-dataset-fragmenter-out/output-solid/aux/parameters-comments.csv",
    "columns": [
      "subject"
    ]
  }
}

Options:

Quad Transformers

Optional

A quad transformer can transform a stream of quads into another stream of quads.

Distinct Quad Transformer

A quad transformer that wraps over another quad transformer and removes duplicates. Only quads that are produced by the quad transformer (and are unequal to the incoming quad) will be filtered away.

{
  "transformers": [
    {
      "@type": "QuadTransformerDistinct",
      "transformer": {
        "@type": "QuadTransformerSetIriExtension",
        "extension": "nq",
        "iriPattern": "^http://dbpedia.org"
      }
    }
  ]
}

Options:

Set IRI Extension Quad Transformer

A quad transformer that enforces the configured extension on all named nodes.

{
  "transformers": [
    {
      "@type": "QuadTransformerSetIriExtension",
      "extension": "nq",
      "iriPattern": "^http://dbpedia.org"
    }
  ]
}

Options:

Replace IRI Quad Transformer

A quad transformer that that replaces (parts of) IRIs.

{
  "transformers": [
    {
      "@type": "QuadTransformerReplaceIri",
      "searchRegex": "^http://www.ldbc.eu",
      "replacementString": "http://localhost:3000/www.ldbc.eu"
    }
  ]
}

This also supports group-based replacements, where a group can be identified via () in the search regex, and a reference to the group can be made via $....

{
  "transformers": [
    {
      "@type": "QuadTransformerReplaceIri",
      "searchRegex": "^http://www.ldbc.eu/data/pers([0-9]*)$",
      "replacementString": "http://www.ldbc.eu/pods/$1/profile/card#me"
    }
  ]
}

Options:

Replace and Distribute IRI Quad Transformer

A quad transformer that that replaces (parts of) IRIs, deterministically distributing the replacements over a list of multiple destination IRI based on a matched number.

{
  "transformers": [
    {
      "@type": "QuadTransformerDistributeIri",
      "searchRegex": "^http://www.ldbc.eu/data/pers([0-9]*)$",
      "replacementStrings": [
        "https://a.example.com/users$1/profile/card#me",
        "https://b.example.com/users$1/profile/card#me",
        "https://c.example.com/users$1/profile/card#me",
        "https://d.example.com/users$1/profile/card#me"
      ]
    }
  ]
}

This requires at least one group-based replacement, of which the first group must match a number.

The matched number is used to choose one of the replacementStrings in a deterministic way: replacementStrings[number % replacementStrings.length]

Options:

Replace BlankNode by NamedNode Transformer

A quad transformer that replaces BlankNodes by NamedNodes if the node-value changes when performing search/ replacement.

{
  "transformers": [
    {
      "@type": "QuadTransformerBlankToNamed",
      "searchRegex": "^b0_tagclass",
      "replacementString": "http://localhost:3000/www.ldbc.eu/tag"
    }
  ]
}

This supports group-based replacements just like the QuadTransformerReplaceIri

Options:

Remap Resource Identifier Transformer

A quad transformer that matches all resources of the given type, and rewrites its (subject) IRI (across all triples) so that it becomes part of the targeted resource.

For example, a transformer matching on type Post for identifier predicate hasId and target predicate hasCreator will modify all post IRIs to become a hash-based IRI inside the object IRI of hasCreator. Concretely, <ex:post1> a <Post>. <ex:post1> <hasId> '1'. <ex:post1> <hasCreator> <urn:person1> will become <urn:person1#Post1> a <Post>. <urn:person1#Post1> <hasId> '1'. <urn:person1#post1> <hasCreator> <urn:person1>.

WARNING: This transformer assumes that all the applicable resources have rdf:type occurring as first triple with the resource IRI as subject.

{
  "transformers": [
    {
      "@type": "QuadTransformerRemapResourceIdentifier",
      "newIdentifierSeparator": "#Post",
      "typeRegex": "vocabulary/Post$",
      "identifierPredicateRegex": "vocabulary/id$",
      "targetPredicateRegex": "vocabulary/hasCreator$"
    }
  ]
}

Optionally, the discovered identifier values can be modified using value modifiers:

{
  "transformers": [
    {
      "@type": "QuadTransformerRemapResourceIdentifier",
      "newIdentifierSeparator": "#Post",
      "typeRegex": "vocabulary/Post$",
      "identifierPredicateRegex": "vocabulary/id$",
      "targetPredicateRegex": "vocabulary/hasCreator$",
      "identifierValueModifier": {
        "@type": "ValueModifierRegexReplaceGroup",
        "regex": "^.*/([^/]*)$"
      }
    }
  ]
}

Options:

Append Quad Transformer

A quad transformer that appends a quad to matching quads (e.g. match by quad predicate).

The example below will effectively add a reverse of quads with the containerOf predicate.

{
  "transformers": [
    {
      "@type": "QuadTransformerAppendQuad",
      "matcher": {
        "@type": "QuadMatcherPredicate",
        "predicateRegex": "vocabulary/containerOf$"
      },
      "subject": {
        "@type": "TermTemplateQuadComponent",
        "component": "object"
      },
      "predicate": {
        "@type": "TermTemplateStaticNamedNode",
        "value": "http:/example.org/vocabulary/containedIn"
      },
      "object": {
        "@type": "TermTemplateQuadComponent",
        "component": "subject"
      },
      "graph": {
        "@type": "TermTemplateQuadComponent",
        "component": "graph"
      }
    }
  ]
}

Options:

More details on term templates can be found later in this README.

Append Quad Link Transformer

A quad transformer that appends a link to matching quads (e.g. match by quad predicate).

{
  "transformers": [
    {
      "@type": "QuadTransformerAppendQuadLink",
      "matcher": {
        "@type": "QuadMatcherPredicate",
        "predicateRegex": "vocabulary/hasCreator$"
      },
      "predicate": "http://example.org/postsIndex",
      "link": "/posts"
    }
  ]
}

Options:

Append Resource Link Transformer

A quad transformer that matches all resources of the given type, and appends a link.

{
  "transformers": [
    {
      "@type": "QuadTransformerAppendResourceLink",
      "typeRegex": "vocabulary/Person$",
      "predicate": "http://example.org/postsIndex",
      "link": "/posts"
    }
  ]
}

Options:

Append Resource SCL Transformer

A quad transformer that matches all resources of the given type, and appends an ACL policy using the scl:appliesTo and scl:scope predicates.

Example output:

<http://example.org/person> a <http://example.org/vocabulary/Person>.
<http://example.org/person#policy-posts> scl:appliesTo <http://example.org/person>;
                                         scl:scope "---MY POLICY---".
{
  "transformers": [
    {
      "@type": "QuadTransformerAppendResourceScl",
      "typeRegex": "vocabulary/Person$",
      "identifierSuffix": "#policy-posts",
      "sclPolicy": "FOLLOW ?posts { <> <http://www.w3.org/1999/02/22-rdf-syntax-ns#seeAlso> ?posts }"
    }
  ]
}

Options:

Append Resource Solid Type Index

A quad transformer that matches all resources of the given type, and adds an entry for it to the Solid type index. This also includes quads required for the creation of this type index, and its link to the user's profile.

If multiple entries of the same type can be matched, it is recommended to wrap this transformer using QuadTransformerDistinct, since duplicate quads can be produced.

Example output:

Profile:

<http://example.org/profile/card#me> solid:publicTypeIndex <http://example.org/settings/publicTypeIndex> .

Type index:

<http://example.org/settings/publicTypeIndex> a solid:TypeIndex> .
<http://example.org/settings/publicTypeIndex> a solid:ListedDocument> .
<http://example.org/settings/publicTypeIndex#comments> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> solid:TypeRegistration> .
<http://example.org/settings/publicTypeIndex#comments> solid:forClass <http://example.org/vocabulary/Comment> .
<http://example.org/settings/publicTypeIndex#comments> solid:instanceContainer <http://example.org/comments/> .
{
  "transformers": [
    {
      "@type": "QuadTransformerAppendResourceSolidTypeIndex",
      "typeRegex": "vocabulary/Comment$",
      "profilePredicateRegex": "vocabulary/hasCreator$",
      "typeIndex": "../settings/publicTypeIndex",
      "entrySuffix": "#comments",
      "entryReference": "../comments/",
      "entryContainer": "true"
    }
  ]
}

Options:

Composite Sequential Transformer

Executes a collection of transformers in sequence.

This is mainly useful in cases you want to group transformers together as group within another composite transformer.

{
  "transformers": [
    {
      "@type": "QuadTransformerCompositeSequential",
      "transformers": [
        {
          "@type": "QuadTransformerSetIriExtension",
          "extension": "nq",
          "iriPattern": "^http://dbpedia.org"
        },
        {
          "@type": "QuadTransformerSetIriExtension",
          "extension": "ttl",
          "iriPattern": "^http://something.org"
        }
      ]
    }
  ]
}

Options:

Composite Varying Resource Transformer

A quad transformer that wraps over other quad transformers, and varies between based based on the configured resource type.

Concretely, it will match all resources of the given type, and evenly distribute these resources to the different quad transformers. It will make sure that different triples from a given resources will remain coupled.

WARNING: This transformer assumes that all the applicable resources have rdf:type occurring as first triple with the resource IRI as subject.

{
  "transformers": [
    {
      "@type": "QuadTransformerCompositeVaryingResource",
      "typeRegex": "vocabulary/Post$",
      "targetPredicateRegex": "vocabulary/hasCreator$",
      "transformers": [
        {
          "@type": "QuadTransformerRemapResourceIdentifier",
          "newIdentifierSeparator": "../posts/",
          "typeRegex": "vocabulary/Post$",
          "identifierPredicateRegex": "vocabulary/id$",
          "targetPredicateRegex": "vocabulary/hasCreator$"
        },
        {
          "@type": "QuadTransformerRemapResourceIdentifier",
          "newIdentifierSeparator": "../posts#",
          "typeRegex": "vocabulary/Post$",
          "identifierPredicateRegex": "vocabulary/id$",
          "targetPredicateRegex": "vocabulary/hasCreator$"
        }
      ]
    }
  ]
}

Options:

Quad Matchers

Different strategies for matching quads. These matchers can for example be used for QuadTransformerAppendQuadLink or FragmentationStrategyExceptionEntry.

Predicate Matcher

Matches a quad by the given predicate regex.

{
  "fragmentationStrategy": {
    "@type": "FragmentationStrategyException",
    "strategy": {
      "@type": "FragmentationStrategySubject"
    },
    "exceptions": [
      {
        "@type": "FragmentationStrategyExceptionEntry",
        "matcher": {
          "@type": "QuadMatcherPredicate",
          "predicateRegex": "vocabulary/predicate1"
        },
        "strategy": {
          "@type": "FragmentationStrategyObject"
        }
      }
    ]
  }
}

Resource Type Matcher

A quad matcher that matches all resources of the given type.

Blank nodes are not supported.

WARNING: This matcher assumes that all the applicable resources have rdf:type occurring as first triple with the resource IRI as subject.

{
  "fragmentationStrategy": {
    "@type": "FragmentationStrategyException",
    "strategy": {
      "@type": "FragmentationStrategySubject"
    },
    "exceptions": [
      {
        "@type": "FragmentationStrategyExceptionEntry",
        "matcher": {
          "@type": "QuadMatcherResourceType",
          "typeRegex": "vocabulary/Person$"
        },
        "strategy": {
          "@type": "FragmentationStrategyResourceObject",
          "targetPredicateRegex": "vocabulary/hasMaliciousCreator$"
        }
      }
    ]
  }
}

Options:

Value modifiers

Different strategies for modifying RDF term values. These modifiers could for example be used in QuadTransformerRemapResourceIdentifier.

Regex Replace Group Value Modifier

A value modifier that applies the given regex on the value and replaces it with the first group match.

{
  "@type": "ValueModifierRegexReplaceGroup",
  "regex": "^.*/([^/]*)$"
}

Term templates

Different templates for deriving a quad component from an incoming quad. Theses templates could for example be used in QuadTransformerAppendQuad.

Quad Component

A term template that returns a given quad's component.

The example below refers to the object of a quad.

{
  "@type": "TermTemplateQuadComponent",
  "component": "object"
}

Options:

Static Named Node.

A term template that always returns a Named Node with the given value.

{
  "@type": "TermTemplateStaticNamedNode",
  "value": "http://localhost:3000/www.ldbc.eu/ldbc_socialnet/1.0/vocabulary/containedIn"
}

Options:

Extend

This tool has been created with extensibility in mind. After forking+cloning this repo and running npm install or yarn install, you can create new components inside the lib/ directory.

The following TypeScript interfaces are available for implementing new components:

If you want to use your newly created component, make sure to execute npm run build or yarn run build. After that, you can include your components in your config file by referring to them from @type with their class name.

License

This software is written by Ruben Taelman.

This code is released under the MIT license.