azavea / osmesa

OSMesa is an OpenStreetMap processing stack based on GeoTrellis and Apache Spark
Apache License 2.0
79 stars 26 forks source link

Address NULL user_id entries #208

Closed jpolchlo closed 3 years ago

jpolchlo commented 3 years ago

After a bulk ingest, it was noticed that there were changesets which lacked metadata. This was particularly a problem for users whose stats weren't being accurately counted. We observed NULL user IDs after a bulk ingest on changesets which had non-zero total edits. Some of these mis-associations arose from an incomplete metadata source, but others were represented in the metadata ORC and were still missing in the final result. The implication is that some records were not being correctly committed to the database.

This PR solves the DB commitment problem by wrapping our DB writes in JDBC transactions. This allows for results to be checked for completeness and rolled back and reattempted in case of trouble. This was by all accounts successful.

This PR does not fix the problem of missing metadata in the changeset ORC file.

I also adjusted the batch-process.sh script to allow multiple steps to be added. This was necessary as a phantom EMR bug appeared (which remains unsolved), preventing proper startup of the spark job due to S3 credentials not being available. We might investigate adding a sleep step to give time for the credentials to materialize (relaunching the initial steps was generally sufficient, but means that the cluster no longer auto-terminates).