Open docuracy opened 3 months ago
The current code for validation and database-insertion of JSON datasets spans 118 lines; by contrast, that for delimited text datasets (ds_insert_delim
and its helper functions) spans 475 lines and is commensurately more difficult to maintain and extend.
The pandas
library (already in use) indicates a way to simplify not only the codebase but also the preparation of datasets, with its json_normalize
method. Using it with recursion into nested structures, LPF JSON can be converted to delimited text reversibly and without any loss of data. The entire LPF JSON feature example given at https://github.com/LinkedPasts/linked-places-format could, for example, could be transformed into the following 80 flattened columns:
type | @id | properties.title | properties.ccodes.0 | properties.fclasses.0 | when.timespans.0.start.in | when.timespans.0.end.in | when.periods.0.name | when.periods.0.@id | when.periods.1.name | when.periods.1.@id | when.label | when.duration | when.certainty | names.0.toponym | names.0.lang | names.0.citations.0.label | names.0.citations.0.year | names.0.citations.0.@id | names.0.when.timespans.0.start.in | names.1.toponym | names.1.lang | names.1.when.timespans.0.start.in | names.1.when.certainty | types.0.identifier | types.0.label | types.0.sourceLabels.0.label | types.0.sourceLabels.0.lang | types.0.when.timespans.0.start.in | geometry.type | geometry.geometries.0.type | geometry.geometries.0.coordinates.0 | geometry.geometries.0.coordinates.1 | geometry.geometries.0.when.timespans.0.start.in | geometry.geometries.0.when.timespans.0.end.in | geometry.geometries.0.citations.0.label | geometry.geometries.0.citations.0.@id | geometry.geometries.0.certainty | geometry.geometries.1.type | geometry.geometries.1.coordinates.0 | geometry.geometries.1.coordinates.1 | geometry.geometries.1.geowkt | geometry.geometries.1.when.timespans.0.start.in | geometry.geometries.1.certainty | links.0.type | links.0.identifier | links.1.type | links.1.identifier | links.2.type | links.2.identifier | links.3.type | links.3.identifier | links.4.type | links.4.identifier | links.5.type | links.5.identifier | relations.0.relationType | relations.0.relationTo | relations.0.label | relations.0.when.timespans.0.start.in | relations.0.when.timespans.0.end.in | relations.1.relationType | relations.1.relationTo | relations.1.label | relations.1.when.timespans.0.start.in | relations.2.relationType | relations.2.relationTo | relations.2.label | relations.2.when.timespans.0.start.in | relations.2.citations.0.label | relations.2.citations.0.year | relations.2.citations.0.@id | relations.2.certainty | descriptions.0.@id | descriptions.0.value | descriptions.0.lang | depictions.0.@id | depictions.0.title | depictions.0.license
The format is extensible for larger arrays (but unlike LPF JSON does not accommodate metadata, which would need to be provided separately, as is already the case for LP-TSV). Most datasets would probably use considerably fewer columns, and any delimited text which uses this column-naming convention could be converted to LPF JSON. WHG users could submit flattened LPF (f-LPF?) as an alternative to LP-TSV. Could be useful to allow JSON objects in uploads to reduce number of required columns; citations could be uploaded as delimited text in the form property | value
(following a template).
The processing pipeline for an uploaded dataset could then be simplified, as below. Chunking and streaming must be used where indicated in order to accommodate the possibility of very large files which would compromise memory and performance of the server. Both validation and insertion ought to run as celery tasks, reporting progress back to the browser's spinner label both to avoid timeout errors and to provide reassurance to users.
start
-> when.timespans.0.start.in
) before continuing...ds_insert_json
:
transaction.atomic
block;bulk_create
: does this resolve the intermittent "duplicate id on insert error" more evident on large datasets?Such a pipeline would readily allow for any extension of the LPF standard, such as the requirement for fclasses
in default of types
, by a modification of the JSON schema (and perhaps of any related Django data models). The fclasses
need a definition, and a "oneOf" enforcement in conjunction with types
.
attestation_year
property from LP-TSVThe current JSON schema (datasets/static/validate/schema_lpf_v1.2.2.json
) could benefit from the use of definitions for common elements like timespans
, periods
, and when
. Additionally (and based on a comparison with the example LPF JSON), ChatGPT identifies several errors that it could easily address (subject, of course, to checking):
Further improvements could be made in the validation pipeline with use of jsonschema
CustomValidators, to check for example that end dates are no earlier than start dates.
LPF could be improved by extending the schema to include an optional citation for the dataset itself in the form of CSL JSON. This would provide much of the metadata required for WHG, and allow downloaded datasets to be better cited.
See also #350 re polygon validation/correction
Based on v1.2.2 and on example record here. Uses definitions where appropriate to avoid duplication and improve flexibility. Improvement of this schema is fundamental to securing the robustness of WHG data and of its upload and validation.
https://whgazetteer.org
-based $id
, and become part of the WHG codebase. In the latest development branch v2.0 would be at https://whgazetteer.org/schema/lpf_v2.0.jsonld
.@context
document (based on Karl's latest draft) would similarly move to https://whgazetteer.org/schema/lpo_v2.0.jsonld
.draft-07/schema#
to draft/2020-12/schema
.FWIW, there is a new (9 Aug) draft lpo: ontology. Also a corresponding (I think) context file for LPF
Not sure the context matters much as I don't think anyone attempts to do reasoning against LPF as an RDF syntax These were developed in the course of making a .ttl export of WHG place data as an experiment.
I'm incorporating some of the compaction functionality of pyld
for preprocessing uploads, based on the context; we could easily offer downloads that employ the expansion functionality for anyone who wants it.
I think json-ld expansion to rdf/xml is very low priority.
Some context: The new draft ontology and context are part of an experimental .ttl serialization, the script for which will be a PR before long. There would be a one-shot off-hours export of all public Place records in some detail, imported to a GraphDB database on another server, where some experimental UI work would happen. The script could be modified in the future to output updates of newly public records for addition to the graph. But that is pending a finding that it is useful in some way. This graph experiment may expand to adding records from other sources.
The impact on WHG for the time being (and quite some time) will be the addition of a single management command that will be accessed only manually - for initial testing on limited number of records, and then one large export.
The exported .ttl is not fully LP format-compatible
Thanks - json-ld expansion is not even low priority.
It strikes me that because the staging server is running with an exact copy (made a few days ago using this script) of the main database, this kind of thing might best be done there?
{'label': 'human settlement', 'identifier': 'wd:Q486972'}
datasets.insert.ds_insert_json
datasets.insert.ds_insert_delim
minmax
function fails if eitherstart
orend
is missing from a timespan)ds_insert_json
:objs
,data_mappings
, andbulk_create_operations
dbcount
is a redundant checksrc_id=feat['@id'] if uribase in ['', None] else feat['@id'].replace(uribase, ''),
should remove only from start of stringrequirements.txt
needs addition ofodfpy==1.4.1
in order to handle upload of.ods
data files.aat_types
, validation requires all to haveaat_types
even if they havefclasses
.Re-think of Dataset Upload & Validation
fclasses
andtimespans
..lpsjon
Resources
/datasets/templates/datasets/dataset_create.html
datasets.views.DatasetCreate
datasets.forms.DatasetUploadForm