WorldHistoricalGazetteer / whg3

Version 3 beta
BSD 3-Clause "New" or "Revised" License
4 stars 4 forks source link

Fix/Streamline Upload Form & Processing #294

Open docuracy opened 3 months ago

docuracy commented 3 months ago

Re-think of Dataset Upload & Validation

Resources

/datasets/templates/datasets/dataset_create.html datasets.views.DatasetCreate datasets.forms.DatasetUploadForm

docuracy commented 2 months ago

Robustness & Maintainability: Data Formats, Validation, & Insertion

The current code for validation and database-insertion of JSON datasets spans 118 lines; by contrast, that for delimited text datasets (ds_insert_delim and its helper functions) spans 475 lines and is commensurately more difficult to maintain and extend.

The pandas library (already in use) indicates a way to simplify not only the codebase but also the preparation of datasets, with its json_normalize method. Using it with recursion into nested structures, LPF JSON can be converted to delimited text reversibly and without any loss of data. The entire LPF JSON feature example given at https://github.com/LinkedPasts/linked-places-format could, for example, could be transformed into the following 80 flattened columns:

type | @id | properties.title | properties.ccodes.0 | properties.fclasses.0 | when.timespans.0.start.in | when.timespans.0.end.in | when.periods.0.name | when.periods.0.@id | when.periods.1.name | when.periods.1.@id | when.label | when.duration | when.certainty | names.0.toponym | names.0.lang | names.0.citations.0.label | names.0.citations.0.year | names.0.citations.0.@id | names.0.when.timespans.0.start.in | names.1.toponym | names.1.lang | names.1.when.timespans.0.start.in | names.1.when.certainty | types.0.identifier | types.0.label | types.0.sourceLabels.0.label | types.0.sourceLabels.0.lang | types.0.when.timespans.0.start.in | geometry.type | geometry.geometries.0.type | geometry.geometries.0.coordinates.0 | geometry.geometries.0.coordinates.1 | geometry.geometries.0.when.timespans.0.start.in | geometry.geometries.0.when.timespans.0.end.in | geometry.geometries.0.citations.0.label | geometry.geometries.0.citations.0.@id | geometry.geometries.0.certainty | geometry.geometries.1.type | geometry.geometries.1.coordinates.0 | geometry.geometries.1.coordinates.1 | geometry.geometries.1.geowkt | geometry.geometries.1.when.timespans.0.start.in | geometry.geometries.1.certainty | links.0.type | links.0.identifier | links.1.type | links.1.identifier | links.2.type | links.2.identifier | links.3.type | links.3.identifier | links.4.type | links.4.identifier | links.5.type | links.5.identifier | relations.0.relationType | relations.0.relationTo | relations.0.label | relations.0.when.timespans.0.start.in | relations.0.when.timespans.0.end.in | relations.1.relationType | relations.1.relationTo | relations.1.label | relations.1.when.timespans.0.start.in | relations.2.relationType | relations.2.relationTo | relations.2.label | relations.2.when.timespans.0.start.in | relations.2.citations.0.label | relations.2.citations.0.year | relations.2.citations.0.@id | relations.2.certainty | descriptions.0.@id | descriptions.0.value | descriptions.0.lang | depictions.0.@id | depictions.0.title | depictions.0.license

The format is extensible for larger arrays (but unlike LPF JSON does not accommodate metadata, which would need to be provided separately, as is already the case for LP-TSV). Most datasets would probably use considerably fewer columns, and any delimited text which uses this column-naming convention could be converted to LPF JSON. WHG users could submit flattened LPF (f-LPF?) as an alternative to LP-TSV. Could be useful to allow JSON objects in uploads to reduce number of required columns; citations could be uploaded as delimited text in the form property | value (following a template).

The processing pipeline for an uploaded dataset could then be simplified, as below. Chunking and streaming must be used where indicated in order to accommodate the possibility of very large files which would compromise memory and performance of the server. Both validation and insertion ought to run as celery tasks, reporting progress back to the browser's spinner label both to avoid timeout errors and to provide reassurance to users.

  1. Try loading a line (csv, tsv) or chunk (xlsx, ods) in a dataframe: is it delimited text? If not, go to 4.
  2. Is it LP-TSV? If so, rename columns to conform to f-LPF (e.g.start -> when.timespans.0.start.in) before continuing...
  3. Is it f-LPF? If so, either line-by-line or by chunking, convert and write to LPF (JSON) before continuing...
  4. Using chunking, try loading as JSON and coincidentally check encoding: is it UTF-8 JSON? If not, exit.
  5. Validate using streaming,
  6. Insert into database with ds_insert_json:
    • move dataset creation out of View and into the transaction.atomic block;
    • modify to employ streaming;
    • accumulate up to 100? objects from multiple features to optimise bulk_create: does this resolve the intermittent "duplicate id on insert error" more evident on large datasets?

Such a pipeline would readily allow for any extension of the LPF standard, such as the requirement for fclasses in default of types, by a modification of the JSON schema (and perhaps of any related Django data models). The fclasses need a definition, and a "oneOf" enforcement in conjunction with types.

The current JSON schema (datasets/static/validate/schema_lpf_v1.2.2.json) could benefit from the use of definitions for common elements like timespans, periods, and when. Additionally (and based on a comparison with the example LPF JSON), ChatGPT identifies several errors that it could easily address (subject, of course, to checking):

Further improvements could be made in the validation pipeline with use of jsonschema CustomValidators, to check for example that end dates are no earlier than start dates.

LPF could be improved by extending the schema to include an optional citation for the dataset itself in the form of CSL JSON. This would provide much of the metadata required for WHG, and allow downloaded datasets to be better cited.

docuracy commented 2 months ago

See also #350 re polygon validation/correction

docuracy commented 2 months ago

Draft LPF v2.0 JSON Schema

Based on v1.2.2 and on example record here. Uses definitions where appropriate to avoid duplication and improve flexibility. Improvement of this schema is fundamental to securing the robustness of WHG data and of its upload and validation.

Questions:

Further improvement:

kgeographer commented 2 months ago

FWIW, there is a new (9 Aug) draft lpo: ontology. Also a corresponding (I think) context file for LPF

Not sure the context matters much as I don't think anyone attempts to do reasoning against LPF as an RDF syntax These were developed in the course of making a .ttl export of WHG place data as an experiment.

docuracy commented 2 months ago

I'm incorporating some of the compaction functionality of pyld for preprocessing uploads, based on the context; we could easily offer downloads that employ the expansion functionality for anyone who wants it.

kgeographer commented 1 month ago

I think json-ld expansion to rdf/xml is very low priority.

Some context: The new draft ontology and context are part of an experimental .ttl serialization, the script for which will be a PR before long. There would be a one-shot off-hours export of all public Place records in some detail, imported to a GraphDB database on another server, where some experimental UI work would happen. The script could be modified in the future to output updates of newly public records for addition to the graph. But that is pending a finding that it is useful in some way. This graph experiment may expand to adding records from other sources.

The impact on WHG for the time being (and quite some time) will be the addition of a single management command that will be accessed only manually - for initial testing on limited number of records, and then one large export.

The exported .ttl is not fully LP format-compatible

docuracy commented 1 month ago

Thanks - json-ld expansion is not even low priority.

It strikes me that because the staging server is running with an exact copy (made a few days ago using this script) of the main database, this kind of thing might best be done there?