ExposuresProvider / cam-pipeline

Data loading pipeline for CAM database
https://exposuresprovider.github.io/cam-pipeline/
MIT License
2 stars 4 forks source link

Add a post-data report generation to cam-pipeline #151

Open gaurav opened 1 week ago

gaurav commented 1 week ago

This would run after kg.tsv has been generated, and generate some kind of report so we know the file was generated correctly. At the simplest, this could check the number of rows is approximately 11,336,863 (which is where it was on the last generation).

Some other stats that might be useful to track:

The main use of this report would be to make sure that we don't make a change that gets rid of a particular type of edge. Once we add qualifiers (#145), we could add a qualifier report as well to see how much detail we're adding.

We could implement this as a Scala Script -- it should be straightforward to implement in ZStream.

balhoff commented 1 week ago

Some of these things might be most efficient to calculate in the souffle script.