SFOE / OGD_qualitychecks

Automated pipeline to check OGD data quality using Frictionless
0 stars 0 forks source link

Streamlit App #13

Closed aresssera closed 8 months ago

aresssera commented 8 months ago

This Streamlit app enables users to perform quality checks on CSV files of SFOE OGD publications using the Frictionless framework. The app allows for easy validation of CSV files against defined schemas (extracted from corresponding datapackage.json) found on uvek-gis, ensuring data integrity and adherence to specified formats.

For now, I deployed it using my personal account. The next step is to create a Streamlit Community Cloud account for one of the remaining members and deploy the app from here.

AFoletti commented 8 months ago

Minor issue: The app still breaks if I try to validate a CSV using ";" instead of ",". I think it would be good to catch this error and inform the user about it instead of letting him see the app not working.

Even more minor issue: You could use the "dialect" element in the datapackage.json in order to infer which separator to expect.

"dialect": {
                "delimiter": ",",
                "doubleQuote": true,
                "lineTerminator": "\r\n",
                "quoteChar": "\"",
                "skipInitialSpace": true,
                "header": true,
                "caseSensitiveHeader": false
            },
AFoletti commented 8 months ago

...for some reason the OGD21 CSVs return validation errors, but they look OK to me. image

aresssera commented 8 months ago

...for some reason the OGD21 CSVs return validation errors, but they look OK to me. image

The behaviour is really strange. I made some tests with other files that contain columns of type integer. For example, if you take OGD105, the type of the column 'Month' is int, but when you access the first element of that column it says float.

image

I assumed that if you changed all column types to float, the file would be recognized as valid. To test this, I changed the type of one column and expected that this column would not be mentioned in the error message. But surprisingly, this made the whole file valid.

image

image

AFoletti commented 8 months ago

Would it be possible to make a quick test with pure frictionless? I mean do not go through the pandas/numpy loop and use the full datapackage.json with pure frictionless to validate OGD21. I clearly remember the same OGD21 pass through my early version of the tests with exactly the same datapackage. I guess that reading the file and storing it temporarily could create some sort of issue

aresssera commented 8 months ago

I tested data_t1.csv using the code from main.py (had to manually upload the file to the staging folder). It is valid.