azavea / osmesa

OSMesa is an OpenStreetMap processing stack based on GeoTrellis and Apache Spark
Apache License 2.0
79 stars 26 forks source link

Fix total edit count reporting in ChangesetStatsForeachWriter #184

Closed CloudNiner closed 4 years ago

CloudNiner commented 4 years ago

This should fix total_edits ending up null as soon as one of the operands is null for any given row write operation.

Notes

Both coalesces are required as a new changeset could have no interesting counts, same as the old summed changesets.

I updated the batch-process.sh script to use a json configuration for its spark conf. This should make it easier to adjust in the future. I also swapped to static instance types and EBS configuration since we need a particular combination of CPU/Mem/Disk resources for this job to run successfully.

There remain count discrepancies between the production and staging databases. I opened a separate epic to continue to investigate this (#186), as those incorrect values are outside the scope of this specific fix.

Testing

To verify this fix, I compared the total_edits in user_statistics with the counts currently on production. While they remain low compared to production, they are much closer than they were before. In addition, I ran a query to determine if there are any changesets that have values in the jsonb counts field but where total_edits remains null. There are none:

osmesa_stats_staging=> select count(*) from changesets where total_edits is null and counts is not null;
 count 
-------
     0
(1 row)
CloudNiner commented 4 years ago

@jpolchlo some changes around whitespace and the env var handling for instance type. Just bumping in case you want to take one more look, even though it was approved earlier.