Fix/Streamline Upload Form & Processing

[x] https://whgazetteer.org/datasets/create/
- [x] Add modal Spinner on submit.
- [x] Licence information ought to have a checkbox acknowledging acceptance.
- [x] Allow repeat submission of label if upload fails - i.e. tidy up properly before returning to form.
- [x] Map Wikidata, :aat, and other types to GeoNames fclasses, e.g. {'label': 'human settlement', 'identifier': 'wd:Q486972'}
- [x] Delete uploaded Essay file
- [x] Add optional default value inputs for fclasses, timespans; each should have a choice of augment, replace, or where missing.
  - [x] Add to form with validation checks
  - [x] JSON: datasets.insert.ds_insert_json
  - [x] Delimited: datasets.insert.ds_insert_delim
- [x] Place detail fails to load on http://localhost:8001/datasets/nnn/places (minmax function fails if either start or end is missing from a timespan)
- [x] Delimited insertion is not atomic (actually more a problem with failure sometimes to delete dataset on insertion failure)
- [x] Need to add columns to df for fclasses/start/end if set as defaults
- [x] Intermittent duplicate id on insert error
- [x] Refactor ds_insert_json:
  - [x] Combine objs, data_mappings, and bulk_create_operations
  - [x] dbcount is a redundant check
  - [x] src_id=feat['@id'] if uribase in ['', None] else feat['@id'].replace(uribase, ''), should remove only from start of string
- [x] Cancel return URL is incorrect in some circumstances.
[ ] https://whgazetteer.org/datasets/1506/places Gives count incorrectly as "None".
[ ] https://whgazetteer.org/datasets/1506/delete also reports "None place records"
[ ] http://localhost:8001/datasets/177/status needs space after "Contributed by:"
[x] requirements.txt needs addition of odfpy==1.4.1 in order to handle upload of .ods data files.
[ ] Zipped sample datasets fail upload due to absence of "valid 'fclasses' or 'aat_types'"
[ ] If a delimited dataset includes any features with aat_types, validation requires all to have aat_types even if they have fclasses.

Re-think of Dataset Upload & Validation

[x] Remove options for default fclasses and timespans.
[x] For the simplest possible UX, consider removing all other optional (metadata) inputs (these values can be added later, before publication).
[ ] Allow zipped uploads
[x] Extract any existing metadata from .lpsjon
[ ] Autosuggest dataset name

Resources

/datasets/templates/datasets/dataset_create.html datasets.views.DatasetCreate datasets.forms.DatasetUploadForm

Robustness & Maintainability: Data Formats, Validation, & Insertion

The current code for validation and database-insertion of JSON datasets spans 118 lines; by contrast, that for delimited text datasets (ds_insert_delim and its helper functions) spans 475 lines and is commensurately more difficult to maintain and extend.

The pandas library (already in use) indicates a way to simplify not only the codebase but also the preparation of datasets, with its json_normalize method. Using it with recursion into nested structures, LPF JSON can be converted to delimited text reversibly and without any loss of data. The entire LPF JSON feature example given at https://github.com/LinkedPasts/linked-places-format could, for example, could be transformed into the following 80 flattened columns:

type | @id | properties.title | properties.ccodes.0 | properties.fclasses.0 | when.timespans.0.start.in | when.timespans.0.end.in | when.periods.0.name | when.periods.0.@id | when.periods.1.name | when.periods.1.@id | when.label | when.duration | when.certainty | names.0.toponym | names.0.lang | names.0.citations.0.label | names.0.citations.0.year | names.0.citations.0.@id | names.0.when.timespans.0.start.in | names.1.toponym | names.1.lang | names.1.when.timespans.0.start.in | names.1.when.certainty | types.0.identifier | types.0.label | types.0.sourceLabels.0.label | types.0.sourceLabels.0.lang | types.0.when.timespans.0.start.in | geometry.type | geometry.geometries.0.type | geometry.geometries.0.coordinates.0 | geometry.geometries.0.coordinates.1 | geometry.geometries.0.when.timespans.0.start.in | geometry.geometries.0.when.timespans.0.end.in | geometry.geometries.0.citations.0.label | geometry.geometries.0.citations.0.@id | geometry.geometries.0.certainty | geometry.geometries.1.type | geometry.geometries.1.coordinates.0 | geometry.geometries.1.coordinates.1 | geometry.geometries.1.geowkt | geometry.geometries.1.when.timespans.0.start.in | geometry.geometries.1.certainty | links.0.type | links.0.identifier | links.1.type | links.1.identifier | links.2.type | links.2.identifier | links.3.type | links.3.identifier | links.4.type | links.4.identifier | links.5.type | links.5.identifier | relations.0.relationType | relations.0.relationTo | relations.0.label | relations.0.when.timespans.0.start.in | relations.0.when.timespans.0.end.in | relations.1.relationType | relations.1.relationTo | relations.1.label | relations.1.when.timespans.0.start.in | relations.2.relationType | relations.2.relationTo | relations.2.label | relations.2.when.timespans.0.start.in | relations.2.citations.0.label | relations.2.citations.0.year | relations.2.citations.0.@id | relations.2.certainty | descriptions.0.@id | descriptions.0.value | descriptions.0.lang | depictions.0.@id | depictions.0.title | depictions.0.license

The format is extensible for larger arrays (but unlike LPF JSON does not accommodate metadata, which would need to be provided separately, as is already the case for LP-TSV). Most datasets would probably use considerably fewer columns, and any delimited text which uses this column-naming convention could be converted to LPF JSON. WHG users could submit flattened LPF (f-LPF?) as an alternative to LP-TSV. Could be useful to allow JSON objects in uploads to reduce number of required columns; citations could be uploaded as delimited text in the form property | value (following a template).

The processing pipeline for an uploaded dataset could then be simplified, as below. Chunking and streaming must be used where indicated in order to accommodate the possibility of very large files which would compromise memory and performance of the server. Both validation and insertion ought to run as celery tasks, reporting progress back to the browser's spinner label both to avoid timeout errors and to provide reassurance to users.

Try loading a line (csv, tsv) or chunk (xlsx, ods) in a dataframe: is it delimited text? If not, go to 4.
Is it LP-TSV? If so, rename columns to conform to f-LPF (e.g.start -> when.timespans.0.start.in) before continuing...
Is it f-LPF? If so, either line-by-line or by chunking, convert and write to LPF (JSON) before continuing...
Using chunking, try loading as JSON and coincidentally check encoding: is it UTF-8 JSON? If not, exit.
Validate using streaming,
Insert into database with ds_insert_json:
- move dataset creation out of View and into the transaction.atomic block;
- modify to employ streaming;
- accumulate up to 100? objects from multiple features to optimise bulk_create: does this resolve the intermittent "duplicate id on insert error" more evident on large datasets?

Such a pipeline would readily allow for any extension of the LPF standard, such as the requirement for fclasses in default of types, by a modification of the JSON schema (and perhaps of any related Django data models). The fclasses need a definition, and a "oneOf" enforcement in conjunction with types.

[x] Needs to handle the attestation_year property from LP-TSV

The current JSON schema (datasets/static/validate/schema_lpf_v1.2.2.json) could benefit from the use of definitions for common elements like timespans, periods, and when. Additionally (and based on a comparison with the example LPF JSON), ChatGPT identifies several errors that it could easily address (subject, of course, to checking):

Inconsistent or Missing Type Declarations: Some properties do not have type declarations, which could lead to issues when validating against the schema.
Property Naming and Structure: The properties' names and structure do not fully align with the example JSON, especially in nested objects.
Incorrect or Missing Required Properties: Certain required properties are missing in the schema, such as the title in the properties object and the type in the geometry object.
Pattern Properties: Some pattern properties are not specified correctly or are not necessary.

Further improvements could be made in the validation pipeline with use of jsonschema CustomValidators, to check for example that end dates are no earlier than start dates.

LPF could be improved by extending the schema to include an optional citation for the dataset itself in the form of CSL JSON. This would provide much of the metadata required for WHG, and allow downloaded datasets to be better cited.

See also #350 re polygon validation/correction

Draft LPF v2.0 JSON Schema

Based on v1.2.2 and on example record here. Uses definitions where appropriate to avoid duplication and improve flexibility. Improvement of this schema is fundamental to securing the robustness of WHG data and of its upload and validation.

Questions:

[ ] Would @kgeographer and the other LinkedPasts contributors approve of WHG adopting oversight of LPF?
- [ ] The JSON schema would have a https://whgazetteer.org-based $id, and become part of the WHG codebase. In the latest development branch v2.0 would be at https://whgazetteer.org/schema/lpf_v2.0.jsonld.
- [ ] The associated @context document (based on Karl's latest draft) would similarly move to https://whgazetteer.org/schema/lpo_v2.0.jsonld.

Further improvement:

[ ] Upgrade from draft-07/schema# to draft/2020-12/schema.
[x] Switch schema to jsonld (from plain json).
[x] Standardization of Titles: Ensure consistent titling across all properties and definitions. For example, "Types->Items->Label" vs. "The Type Schema." Consistent naming conventions should be used throughout.
[x] Pattern Validation: Some string patterns are very loose, e.g., ^(.*)$. Consider using more specific regex patterns where possible to improve data validation.
[ ] Required Fields: Ensure that required fields are consistently applied across similar structures. For example, some object definitions have no required fields, while others do.
[ ] Object Descriptions: Adding descriptions for more properties could improve schema readability and usability. It can provide more context for users implementing the schema.
[ ] Examples in Schema: Some properties have examples provided, while others do not. Adding examples for all relevant fields could enhance clarity.
[ ] Default Values: Consider setting meaningful default values where possible, instead of leaving them empty. This can help with data consistency.
[x] Enum Definitions: For properties using enum, ensure that all possible values are listed and adequately described in the schema documentation.
[ ] Array Definitions: Ensure arrays have clear minimum and maximum item constraints where applicable, to enforce expected data ranges.
[ ] Required Fields in Arrays: In some array objects, required fields are not consistently defined. For instance, in types and names, ensure that all items within arrays have necessary required fields.
[ ] Documentation: Ensure that there is enough documentation or inline comments in the schema explaining complex definitions or structures.
[ ] Consistency in Property IDs: Ensure all $id fields are consistently applied and match the property path. Some $id fields might be unnecessary and could be removed.
[x] Schema Definitions: Consider centralizing common properties or object structures into definitions to avoid redundancy and make the schema easier to maintain.
[ ] Property Names: Review property names for clarity and consistency, particularly for nested properties. Avoid overly generic names where possible.
[ ] Additional Properties: Review the use of "additionalProperties": true to ensure that it's only used where flexibility is needed and does not introduce unintended flexibility.
[x] Add CSL JSON Citation root property

FWIW, there is a new (9 Aug) draft lpo: ontology. Also a corresponding (I think) context file for LPF

Not sure the context matters much as I don't think anyone attempts to do reasoning against LPF as an RDF syntax These were developed in the course of making a .ttl export of WHG place data as an experiment.

I'm incorporating some of the compaction functionality of pyld for preprocessing uploads, based on the context; we could easily offer downloads that employ the expansion functionality for anyone who wants it.

I think json-ld expansion to rdf/xml is very low priority.

Some context: The new draft ontology and context are part of an experimental .ttl serialization, the script for which will be a PR before long. There would be a one-shot off-hours export of all public Place records in some detail, imported to a GraphDB database on another server, where some experimental UI work would happen. The script could be modified in the future to output updates of newly public records for addition to the graph. But that is pending a finding that it is useful in some way. This graph experiment may expand to adding records from other sources.

The impact on WHG for the time being (and quite some time) will be the addition of a single management command that will be accessed only manually - for initial testing on limited number of records, and then one large export.

The exported .ttl is not fully LP format-compatible

Thanks - json-ld expansion is not even low priority.

It strikes me that because the staging server is running with an exact copy (made a few days ago using this script) of the main database, this kind of thing might best be done there?

WorldHistoricalGazetteer / whg3