Open danrademacher opened 2 years ago
@kamicut @willemarcel - would be great to get your eyes and thoughts on this -- what might be the good open source tool that should be relatively easy to setup that we can get reading from OHM Planet files / the db that can dump out some of these stats and keep them updated.
cc @geohacker
@batpad We've already written a CLI tool for this based on the planet files on AWS! https://github.com/developmentseed/little-owl
You'd need to have the same setup to import the OHM planet files into ORC files https://medium.com/@r.ragupathy85/orc-files-in-openstreetmap-9c2a7954e4bc and then put them in AWS Athena for easy querying.
However if those are the only two queries you're interested in, you could probably grab them from the Postgres DB directly using an index, no?
However if those are the only two queries you're interested in, you could probably grab them from the Postgres DB directly using an index, no?
Yea, I imagine that list of questions is going to grow though - and if we start building something that initially just answers these two queries, we'd still need some scaffolding to display them on a URL - and then the list of questions will inevitably grow, and if there's something existing for OSM that we could use, it seems like a much better thing long-term.
You'd need to have the same setup to import the OHM planet files into ORC files https://medium.com/@r.ragupathy85/orc-files-in-openstreetmap-9c2a7954e4bc and then put them in AWS Athena for easy querying.
If I remember correctly, this setup did not have a good way to "update" the ORC file with minutely changes. So we'd need to run a job to, at some cadence, convert our planet file to ORC and then upload to s3, and then we'd be able to query it using little-owl
?
I really like this idea - having OHM changeset history and planet queryable via Athena does open up many use-cases and makes both "simple" things like the questions above and more complex things queryable in a way that scales and does not put pressure on our postgres db + will benefit from any other tools people create around OSM on Athena. I think my biggest concern is the infrastructure to process the planet file into ORC as a job and put it on s3, etc. Let's chat about this and see if we can put together a quick estimation of what this would involve.
Correct, there is no good way to update minutely without massive infrastructure. The Athena files for OSM are updated weekly.
@Rub21 @geohacker - do we know what code powers https://planet.openstreetmap.org/statistics/data_stats.html ? It seems like if we could get a page like that up for OHM, it should give us what we need for now (cc @jeffreyameyer ?) and I assume the code to generate that is somewhere?
Would this be enough for the moment? https://openhistoricalmap.org/stats.html
The basic stats coming from running a script on the rails code. I ran this manually for now by just going inside the production website pod and running ./scripts/statistics > public/stats.html
I dont think we should consider a workflow that involves ORC because there's no easy way to update and it adds a lot of infrastructure we have to pay for. I think for the OHM needs, we should be able run a bunch of basic osmosis queries and export that on a regular basis.
@geohacker oh this sounds amazing! just some-how https://openhistoricalmap.org/stats.html is not working for me - are you sure that's the correct URL?
Ah, I see we have two web containers running these days - so the script ran on one and not the other - just running it manually on the other container now to see what this looks like.
This sounds great to me - we'll just need to figure out the best way to keep this updated - maybe run it as a cron in the container / or a separate container that runs it as a CronJob resource and uploads to s3 or so?
I think adding a cron would be fine?
Discussion today:
@batpad and @geohacker to sketch out tech approach
I think something basic to start:
The sql should read a DB schema. Ideally same as what's used by the rails app. Each sql file creates a csv. And the script can push to S3.
@batpad @Rub21 thoughts?
I think something basic to start:
The sql should read a DB schema. Ideally same as what's used by the rails app. Each sql file creates a csv. And the script can push to S3.
@batpad @Rub21 thoughts?
Scheduled for April 3 sprint. Ideas for the stats we want:
Over time:
Should take about a week to produce the data on S3 as JSON or CSV
@kamicut is getting started on this now https://github.com/developmentseed/osm-seed/issues/290
@danrademacher @batpad , I set up the cronjob for osm-simple-stats , it is generating the stats for staging, http://planet-staging.openhistoricalmap.org/,
Currently it is storing the data as CSV in the bucket where we are sharing the other replication files.
This is super, thanks @Rub21 !
@kamicut @danrademacher how should we work out next steps here? Does it help for @kamicut to work on a simple Observable / jupyter notebook that reads the CSV files and generates some basic visualization to help get a better sense of how to use these CSV files? Or should we schedule a time to chat to figure next steps?
@kamicut can you check http://planet-staging.openhistoricalmap.org/ and see if the output files all seem okay to you and we are not missing anything?
@Rub21 -- we probably need to make sure we have CORS enabled on the s3 bucket - if there's an easy way to check, that would be great, otherwise we can deal with it if we run into issues.
An observable or jupyter notebook would be great! I think Jeff wanted that as a demo of how others can use this regardless of what we build, so that's a solid next step
@Rub21 -- we probably need to make sure we have CORS enabled on the s3 bucket - if there's an easy way to check, that would be great, otherwise we can deal with it if we run into issues.
I have just checked the the CORS in the buckets, request should work fine, with no issues.
@Rub21 now that you deployed these stats to production, what are the new URLs for the CSVs?
We have a URL like this on staging: http://planet-staging.openhistoricalmap.org.s3.amazonaws.com/cumulative_elements.csv
Since production Planet replication is at links like this: http://planet.openhistoricalmap.org.s3.amazonaws.com/planet/planet-200514_1030.osm.pbf
I assumed for stats we'd have: http://planet.openhistoricalmap.org.s3.amazonaws.com/cumulative_elements.csv
But that's an error page.
I see those links are working now, and I am pulling them into this Google Sheet.
I now @kamicut is is working on an Observable NB version, which is pending resolution of https://github.com/OpenHistoricalMap/issues/issues/536 to be able to pull in production data.
An interesting fact with our production data -- a single user, flha127, mapping roads in France, accounts for 20% of all changesets!
And the top 10 editors account for 43% of all changesets. And the top 50 users account for 73% of all changesets. Not surprising given where OHM is at in its evolution, but certainly notable.
Now that the https
is working on planet.openhistoricalmap.org, does the Observable still need to be updated to pull from the real production data?
I am working on some reprioritization of our backlog, and curious if we should put this back on top of our list, or if @1ec5's nice additions to the Wiki Page, https://wiki.openstreetmap.org/wiki/OpenHistoricalMap#OHM_Web_Services, serve some of this need?
Or if we want to add to that with these granular stats, should we do it in a way that can be embedded into the Wiki?
The statistics on the wiki only include a single aggregate number based on this file, so it isn’t quite the same thing as having a leaderboard or something integrated into a user profile. As far as that single aggregate number, though, putting something on the wiki does create a historical record. It’s just a lot more manual than it should be. It would be nice if we could keep archives of these statistics as we do with older planet files.
Discussion at GreenInfo:
Key views:
Placing this wiki stats page here as a reference, and as mentioned by @1ec5 two comments above: https://wiki.openstreetmap.org/wiki/OpenHistoricalMap/Statistics
We just put together an initial content outline in this tab of our data sheet: https://docs.google.com/spreadsheets/d/1agos5gTR8PEo3pwiUhxFEbVWHPFDhXdLb8vUJar7LEI/edit#gid=438203943
Tim to work on a design this week.
We also discussed the possibility of using the vector tiles as we have them with an alternative style that uses the Feature ID as a proxy for recency (higher integer ID = more recent). We think this is worth some experimentation since it would require no changes on the data side. If it does not work, we could also look into other options for map views of changeset boundaries, or just not have a map and let folks use History or OSMcha for that (though these are not designed to give an overview beyond the most recent changes)
We're finally getting to this!
@tsinn put together a mockup:
And then we also experimented a bit with using Kepler GL to visualize all 90,000 nodes in OHM using the item Id as a proxy for recency.
Here's what that looks like (red is more recent, yellow less recent):
Pretty cool! this achieved two things:
Note that for the purposes of testing, I think we're using only the nodes that we publish as nodes on their own (so not nodes that are part of ways).
If we wanted to make an interactive map that visualized every node in the database using the object ID as a way to show newer vs older, but also with some time filtering like shown in the mockup, we could do something like this:
lat,lng,id
. Maybe that gets regenerated weekly or monthly or daily.year,id
Jeff was asking about the easiest way to get easy stats for:
We discussed OSMCha setup, see https://osmcha.org/ and https://github.com/developmentseed/scoreboard
Or just write tools to dump out stats?
Sanjay will poll internally at DevSeed to see if there are other options
See #306