Problem: It is difficult to validate rights.csv and metadata.csv before ingest

goetzk commented 5 years ago

Please describe the problem you'd like to be solved.

At the moment its easy to make mistakes in the metadata.csv and rights.csv files and hard to detect them. It would be great to have a way that the user can validate them before they are wrapped up by transfer and fed in to micro services.

Describe the solution you'd like to see implemented.

Ideally the files would be checked in multiple places (see end of this section), but the focus of this request is for a view within Archivematica that the user can upload these files for validation.

Thoughts so far on validation issues to check:

Spaces around commas (,)
Missing/missed files (this would not be feasible for a web upload but would be for CLI checks)
- See also #462
Correct number of columns (are the rows the same length)
Spelling of headings
Spelling of CSV data where the field is using a fixed taxonomy
Possibly comparing fields in a single column for similarity and flagging up outliers
Date formats
If it was sufficiently flexible it might be possible for integrations (like AtoM, dspace) to add in their own validation requirements.

Other places to validate the files (and provide meaningful errors) would be in the relevant micro services and ideally a CLI script which could be called by bulk uploading tools.

For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

All PRs related to this issue are properly linked 👍
All PRs related to this issue have been merged 👍
Test plan for this issue has been implemented and passed 👍
Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍

ross-spencer commented 5 years ago

helrond commented 5 years ago

Unintended errors in CSV files has definitely been a headache for us as well. I wonder, though, if a upload UI is the right approach? I'd suggest considering an API endpoint which accepts a CSV payload, runs validation of the file and then returns the results. That would allow other services and systems to use that endpoint to validate these files automatically, rather than relying on a human to upload and check in a UI.

FWIW we built out some really rudimentary validation of rights.csv using the Python CSVValidator library here: https://github.com/RockefellerArchiveCenter/fornax/blob/master/sip_assembly/library.py#L104

goetzk commented 5 years ago

@helrond while I'm not opposed to an API endpoint I know for my institution having people upload directly would be more helpful - we don't have extensive automated integration with archivematica and almost all data is entered in CSV files manually or semi manually. If only an API endpoint is supplied we would have to build out tools to provide the testing view to users ourselves.

I suspect there are many institutions in the same situation

ross-spencer commented 5 years ago

@helrond @goetzk from as neutral a standpoint as I can be, I do like the idea of pulling a development approach apart like this, a) providing an API endpoint first to provide other ways to supply this data, which could then open up the opportunity for others to develop UI-based tooling. And thent b) that could then lead to tooling being developed in Archivematica, either using some of those community-driven examples utilizing the API, or through another approach using the API with an organisation looking to invest in that.

An API could be a good first step. Though there may be plenty of other models that folks might want to look at this which should be explored as well.

Actually, in writing this out, and considering your issue in a bit more detail, I am reminded of a CSV validation tool developed by my former colleagues at The National Archives, UK. https://blog.nationalarchives.gov.uk/blog/csv-validator-new-digital-preservation-tool/ if a schema were written then it could then be an extra step before transfer to validate your spreadsheet using something like this, or the Open Data Institute's https://csvlint.io/. I have enjoyed writing schema in the latter before, but didn't quite get it into an end-to-end workflow in my previous work.

Obviously something like a schema for a component of Archivematica could have quite a wide-interest so if you write something, please do consider sharing!

helrond commented 5 years ago

Thanks for this, @ross-spencer! CSVLint looks really useful, and more generally, having explicit schemas for the various CSV files seems like it would be helpful both for the Archivematica community as well as Artefactual.

Talking about this with @bonniegee it sounds like we are interested in building out some schemas as part of some other Archivematica-adjacent projects, and will think about ways to get the word out about this approach and/or any schemas we develop that are generalized/generalizable enough to be used by others.

I'm not super familiar with all the validation Archivematica is doing against various CSV files, but it seems to me that if there are existing tools which do this kind of validation then IMO pulling those tools into core Archivematica should not be a high priority.

ablwr commented 5 years ago

Hi all, as I mention in the above PR, I'd be interested in exploring the possibility of developing and integrating the above script into Archivematica as an API-like CSV validator feature, as described above. This script was written prior to this conversation, but I/we may have the capacity to use this project as a way to create a minimum-viable feature such as this.

I can work on an ADR (https://github.com/artefactual-labs/archivematica-architectural-decisions) if there is interest. @goetzk, it'd be different from your initial request, but like @ross-spencer says, we may be able to start with this and move into something like your initial idea at a later time, with dedicated time/funding.

goetzk commented 5 years ago

On 21/3/19 09:24, Ashley wrote:

I can work on an ADR (https://github.com/artefactual-labs/archivematica-architectural-decisions) if there is interest. @goetzk https://github.com/goetzk, it'd be different from your initial request, but like @ross-spencer https://github.com/ross-spencer says, we may be able to start with this and move into something like your initial idea at a later time, with dedicated time/funding.

Hi Ashley,

I'm certainly not opposed to an API endpoint and if that is the logical starting place then I don't mind that being what happens. My only concern with the current state of the issue is that my initial request may end up being ignored due to the changed focus.

Karl.

ablwr commented 5 years ago

Hi @goetzk, it is true that this would not be work on your initial request. I hope that by solving my above problem, it will make your request easier to build out in the future, like @ross-spencer indicated above, if your feature request is something that gets developed or funded for development in the future.

sromkey commented 5 years ago

Something we're learning through an automated testing project that we're doing is that features that are UI-based only are difficult to test- the tests are by necessity based on the UI and become brittle (e.g. they break and need to be fixed everytime we make a UI change). IMO an API-first approach is best when feasible- the feature can always be built out in the UI if time/funds become available.

archivematica / Issues

Problem: It is difficult to validate rights.csv and metadata.csv before ingest #563