archesproject / arches

Arches is a web platform for creating, managing, & visualizing geospatial data. Arches was inspired by the needs of the Cultural Heritage community, particularly the widespread need of organizations to build & manage cultural heritage inventories
GNU Affero General Public License v3.0
216 stars 144 forks source link

EDTF timespans are not validated on input #11351

Open johnatawnclementawn opened 2 months ago

johnatawnclementawn commented 2 months ago

This issue was originally identified in the context of the bulk data manager, but should be fixed in the EDTF datatype's validation layer.

The EDTF format allows the storage of a timespan by using the "/" delimiter like this: 2023-06/2024-01. If your input csv contains incorrectly formatted EDTF timespan (e.g. 2024-01/2023-06), the data will be loaded, but the resource will fail to index and you will be blocked from making any other changes to the resource in the resource editor until you have fixed the incorrectly formatted EDTF data.

edtf_help

Identified in Arches-EWAP, Arches 7.5.1

chiatt commented 2 months ago

This doesn't seem like it's an issue with the bulk data manager, but rather with the EDTF datatype's validation logic. While it might make sense for some bulk modules to do some validation, ultimately it's up to the datatype's validation to ensure that nothing is saved to a tile that cannot be indexed.

johnatawnclementawn commented 2 months ago

Point well taken, @chiatt. I'd like to point out that you are not able to save invalid EDTF timespans via the resource editor, but you can save invalid data if using the single-csv bulk data module, which is why I initially attributed this to the BDM. I suppose this could be a case of data validation layers existing both in the front-end and back-end?

chiatt commented 2 months ago

Point well taken, @chiatt. I'd like to point out that you are not able to save invalid EDTF timespans via the resource editor, but you can save invalid data if using the single-csv bulk data module, which is why I initially attributed this to the BDM. I suppose this could be a case of data validation layers existing both in the front-end and back-end?

There shouldn't be any need for front-end validation because the bulk importers call datatype.validate on all values before actually staging any data: https://github.com/archesproject/arches/blob/c6b043ae0357892f25f1c143daadde1162ff4e19/arches/app/etl_modules/base_import_module.py#L116

Perhaps this is getting bypassed somehow in single-csv? At a glance it looks okay, but there must be something wrong: https://github.com/archesproject/arches/blob/c6b043ae0357892f25f1c143daadde1162ff4e19/arches/app/etl_modules/import_single_csv.py#L371C58-L371C82