frictionlessdata / datapackage-pipelines

Framework for processing data packages in pipelines of modular components.
https://frictionlessdata.io/
MIT License
117 stars 32 forks source link

[dataflows] An ability to add goodtables checks to the validate processor #192

Closed cschloer closed 3 years ago

cschloer commented 4 years ago

https://github.com/datahq/dataflows/issues/142

adyork commented 3 years ago

We talked about having a test that would check values in a sci_name column for abbreviated taxon names like first letter genus, then period then species name like G. morhua. We would suggest using the full genus like Gadus morhua.

We would want to flag any sci_names that match ^\w\.

Note that these are actually good and we don't want to flag them: Gadus sp. Gadus spp. https://regex101.com/r/8eFGXw/1/

roll commented 3 years ago

Closing it for now, as we decided to separate dataflows/goodtables logic

adyork commented 3 years ago

I had started a python notebook in Colab that installed the dataflows commit we wanted to test (from https://github.com/datahq/dataflows/pull/146) and loaded some test data for this issue from our frictionless-usecases repo that has "bad" names we want to check. Didn't get to testing validate_metadata which is probably good because it isn't being further developed.

I'm linking here in case we want to modify this to do goodtables testing or whatever implementation. The link to the data, and basic flow is there. https://gist.github.com/adyork/9ae791ebee7b0b651be034ec1b033c18#file-test-field-name-validation-ipynb

load('https://github.com/BCODMO/frictionless-usecases/raw/master/usecases/818993_seabirdCTD/orig/head/FK190211_CTD004_01032019.csv', format='csv', ), image