RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.15k stars 555 forks source link

Establish clear provenance for all non original test data. #1840

Open aucampia opened 2 years ago

aucampia commented 2 years ago

All original test data should have clear provenance so we know that we are testing the right things, this is in part to mitigate problems like like this. The best way to establish provenance is to programatically download test data, and then make it possible to re-dowload the test data as part of our test run and then ensuring it has not changed.

It would be good to solve this before adding more test data.

aucampia commented 2 years ago

I started working on a Makefile for this, but I think doing this from python may be more sensible as people working on this library likely know python better than GNU make and Python is much more portable and less quirky than GNU Make.

# This file exists mainly to declaratively establish the provenance of test data.
# Runing this file with `make -B all` should redownload all test data with established provanance and should result in no changes to the files on dis.

all:

all: rdfs.ttl
rdfs.ttl:
    curl -L --header "Accept: text/turtle" http://www.w3.org/2000/01/rdf-schema# > $(@)

all: defined_namespaces/qb.ttl
defined_namespaces/qb.ttl:
    curl -L --header "Accept: text/turtle" http://purl.org/linked-data/cube > $(@)

all: suites/w3c/turtle/README
suites/w3c/turtle/README:
    rm -vr $(dir $(@)) || true
    mkdir -vp $(dir $(@))
    curl https://www.w3.org/2013/TurtleTests/TESTS.tar.gz | tar -zxvf - --strip-components=1 -C $(dir $(@))

all: suites/w3c/nquads/README
suites/w3c/nquads/README:
    rm -vr $(dir $(@)) || true
    mkdir -vp $(dir $(@))
    curl https://www.w3.org/2013/N-QuadsTests/TESTS.tar.gz | tar -zxvf - --strip-components=1 -C $(dir $(@))

all: suites/w3c/ntriples/README
suites/w3c/ntriples/README:
    rm -vr $(dir $(@)) || true
    mkdir -vp $(dir $(@))
    curl https://www.w3.org/2013/N-TriplesTests/TESTS.tar.gz | tar -zxvf - --strip-components=1 -C $(dir $(@))

all: suites/w3c/trig/README
suites/w3c/trig/README:
    rm -vr $(dir $(@)) || true
    mkdir -vp $(dir $(@))
    curl https://www.w3.org/2013/TrigTests/TESTS.tar.gz | tar -zxvf - --strip-components=1 -C $(dir $(@))

# TODO FIXME: This directoy contains additional files that should be removed:
# - Manifest.rdf
# - datatypes/test001.borked
all: suites/w3c/rdfxml/README
suites/w3c/rdfxml/README:
    rm -vr $(dir $(@)) || true
    mkdir -vp $(dir $(@))
    curl https://www.w3.org/2013/RDFXMLTests/TESTS.tar.gz | tar -zxvf - --strip-components=1 -C $(dir $(@))

# TODO FIXME: This directory contains differences from upstream, it seems to be from an older source.
all: suites/DAWG/data-sparql11/manifest-all.ttl
suites/DAWG/data-sparql11/manifest-all.ttl:
    rm -vr $(dir $(@)) || true
    mkdir -vp $(dir $(@))
    curl https://www.w3.org/2009/sparql/docs/tests/sparql11-test-suite-20121023.tar.gz \
        | tar -zxvf - --strip-components=1 -C $(dir $(@))
    find $(dir $(@)) -type f -print0 | xargs -0 chmod -v 644
    find $(dir $(@)) -type f -print0 | xargs -0 dos2unix
    find $(dir $(@)) -type d -print0 | xargs -0 chmod -v 755
aucampia commented 2 years ago

I'm working on this as part of https://github.com/RDFLib/rdflib/issues/1807 and https://github.com/RDFLib/rdflib/issues/1701 - as I want to download n3 test data from https://github.com/w3c/N3/tree/master/tests. I will write it in python, it may be slightly more verbose than writing a Makefile but Makefiles have their own host of problems.