biglocalnews / warn-transformer

Consolidate, enrich and republish the data gathered by warn-scraper
https://warn-transformer.readthedocs.io
Apache License 2.0
4 stars 3 forks source link

Automated QA checks #236

Open stucka opened 6 months ago

stucka commented 6 months ago

@Kirkman realized scrapers can begin failing and produce empty CSVs, but there's no process in place to flag those failures. See https://github.com/biglocalnews/warn-scraper/issues/598

As I understand it, warn-transformer consolidate is going to pull everything together from historical data and new scrapes and then eliminate duplicates. This would be the only time in which we've got all the new scrapes, and we can see whether new scrape files are empty, or have fewer entries than the historical data that's available.

It might be easy enough to build in an error/alert here that doesn't stop the rest of the transform from working, but does send a message through the internal BLN ETL alerts channel -- likely by mimicking what's in the Github workflow, except I think that requires a logger.error or something similar to actually get triggered.

It's also possible there may be some cases in which counts should be reduced -- e.g., a state decides a notice in the system isn't actually a WARN notice but is a non-WARN layoff. I think we saw that in Maine early on, where a WARN notice disappeared. Once recorded by warn-transformer, those aren't coming back, so the count of the missing will grow.

There will also be cases where a state takes down a previous year's data, and the scraper will have less to work with.

So ... where to go? CSVs with only a header row are not ever going to be correct. That's a bare minimum for flagging, but building against that might make it harder to implement more in-depth QA work.

stucka commented 6 months ago

I think this can be easily detected in consolidate. Will logger.warning make it into Github Actions logs? Should it be integrated with the existing alers workflow?

https://github.com/biglocalnews/warn-transformer/blob/82c55f94947410245cd8649112f33da10fdb85fb/warn_transformer/consolidate.py#L41

stucka commented 6 months ago

It ... cannot be easily detected in consolidate. Logging has been improved, though.

chriszs commented 3 months ago

So one approach would to throw an error, failing the transformation for that state and creating a failed status that could be reported. Granted, that might obscure an otherwise successful run.

Okay, so how to do reporting on data quality without halting or logging ineffectually? I'm looking a little bit at Great Expectations, which seems to have gotten very enterprise-y and been rebranded "GX OSS" to clear the way for a parallel SaaS business, but which might still be the right general tool for this sort of thing.

The Great Expectations way of doing this seems way more complicated than the nice one line if statement test you've got there, but it does seem to have ways of building data docs with validation results and configurable alerting. I wonder if there are other things it could be used to test for.

As you say, testing for this one case may be a different thing than the general case of QA checks.

chriszs commented 3 months ago

Did some exploration using Great Expectations in #252, creating a check that looks at each raw file and verifies the row count is three or greater.

After this, I tend to agree with the thrust of the Reddit post I found headlined "Great Expectations is annoyingly cumbersome" (the Dickens novel doesn't appear to be well-loved either). Ah well, I had such high hopes, but maybe SaaS ruins everything. On the other hand, maybe it's not so bad once you learn the concepts/get past the initial set up.

This doesn't do exactly what the current if check does, because it looks at each raw data file before transformation, though we could also look at the data after consolidation and build some sort of list of sources we expect to see in there.

chriszs commented 3 months ago

One of the things this might help address: the case where a state's data or a scraper's output loses quality over time, or runs into issues on as-of-yet unseen documents. It'd be nice to have some row and/or column-level expectations set up that could flag that.