User and changeset stats design

danrademacher commented 2 years ago

Jeff was asking about the easiest way to get easy stats for:

How many users?
How many changesets?

We discussed OSMCha setup, see https://osmcha.org/ and https://github.com/developmentseed/scoreboard

Or just write tools to dump out stats?

Sanjay will poll internally at DevSeed to see if there are other options

See #306

batpad commented 2 years ago

@kamicut @willemarcel - would be great to get your eyes and thoughts on this -- what might be the good open source tool that should be relatively easy to setup that we can get reading from OHM Planet files / the db that can dump out some of these stats and keep them updated.

cc @geohacker

kamicut commented 2 years ago

@batpad We've already written a CLI tool for this based on the planet files on AWS! https://github.com/developmentseed/little-owl

You'd need to have the same setup to import the OHM planet files into ORC files https://medium.com/@r.ragupathy85/orc-files-in-openstreetmap-9c2a7954e4bc and then put them in AWS Athena for easy querying.

However if those are the only two queries you're interested in, you could probably grab them from the Postgres DB directly using an index, no?

batpad commented 2 years ago

However if those are the only two queries you're interested in, you could probably grab them from the Postgres DB directly using an index, no?

Yea, I imagine that list of questions is going to grow though - and if we start building something that initially just answers these two queries, we'd still need some scaffolding to display them on a URL - and then the list of questions will inevitably grow, and if there's something existing for OSM that we could use, it seems like a much better thing long-term.

You'd need to have the same setup to import the OHM planet files into ORC files https://medium.com/@r.ragupathy85/orc-files-in-openstreetmap-9c2a7954e4bc and then put them in AWS Athena for easy querying.

If I remember correctly, this setup did not have a good way to "update" the ORC file with minutely changes. So we'd need to run a job to, at some cadence, convert our planet file to ORC and then upload to s3, and then we'd be able to query it using little-owl?

I really like this idea - having OHM changeset history and planet queryable via Athena does open up many use-cases and makes both "simple" things like the questions above and more complex things queryable in a way that scales and does not put pressure on our postgres db + will benefit from any other tools people create around OSM on Athena. I think my biggest concern is the infrastructure to process the planet file into ORC as a job and put it on s3, etc. Let's chat about this and see if we can put together a quick estimation of what this would involve.

kamicut commented 2 years ago

Correct, there is no good way to update minutely without massive infrastructure. The Athena files for OSM are updated weekly.

batpad commented 2 years ago

@Rub21 @geohacker - do we know what code powers https://planet.openstreetmap.org/statistics/data_stats.html ? It seems like if we could get a page like that up for OHM, it should give us what we need for now (cc @jeffreyameyer ?) and I assume the code to generate that is somewhere?

geohacker commented 2 years ago

Would this be enough for the moment? https://openhistoricalmap.org/stats.html

The basic stats coming from running a script on the rails code. I ran this manually for now by just going inside the production website pod and running ./scripts/statistics > public/stats.html

I dont think we should consider a workflow that involves ORC because there's no easy way to update and it adds a lot of infrastructure we have to pay for. I think for the OHM needs, we should be able run a bunch of basic osmosis queries and export that on a regular basis.

batpad commented 2 years ago

@geohacker oh this sounds amazing! just some-how https://openhistoricalmap.org/stats.html is not working for me - are you sure that's the correct URL?

batpad commented 2 years ago

Ah, I see we have two web containers running these days - so the script ran on one and not the other - just running it manually on the other container now to see what this looks like.

This sounds great to me - we'll just need to figure out the best way to keep this updated - maybe run it as a cron in the container / or a separate container that runs it as a CronJob resource and uploads to s3 or so?

geohacker commented 2 years ago

I think adding a cron would be fine?

danrademacher commented 1 year ago

Discussion today:

DevSeed make a script that dumps CSV to S3 once or twice a day
Then GreenInfo makes a frontend to display that info, separate from the OHM website whole stack

@batpad and @geohacker to sketch out tech approach

geohacker commented 1 year ago

I think something basic to start:

start a new repo osm-seed-stats with a Dockerfile and a way to store individual sql files and a script that runs the sql against a DB url set in the environment
add a container in osm-seed using the osm-seed-stats as CronJob.

The sql should read a DB schema. Ideally same as what's used by the rails app. Each sql file creates a csv. And the script can push to S3.

@batpad @Rub21 thoughts?

geohacker commented 1 year ago

I think something basic to start:

start a new repo osm-seed-stats with a Dockerfile and a way to store individual sql files and a script that runs the sql against a DB url set in the environment
add a container in osm-seed using the osm-seed-stats as CronJob.

The sql should read a DB schema. Ideally same as what's used by the rails app. Each sql file creates a csv. And the script can push to S3.

@batpad @Rub21 thoughts?

danrademacher commented 1 year ago

Scheduled for April 3 sprint. Ideas for the stats we want:

How many users?
How many changesets?
Count of Nodes, Ways, Relations

Over time:

Last day
Last week
Last month

Should take about a week to produce the data on S3 as JSON or CSV

geohacker commented 1 year ago

@kamicut is getting started on this now https://github.com/developmentseed/osm-seed/issues/290

Rub21 commented 1 year ago

@danrademacher @batpad , I set up the cronjob for osm-simple-stats , it is generating the stats for staging, http://planet-staging.openhistoricalmap.org/,

Currently it is storing the data as CSV in the bucket where we are sharing the other replication files.

batpad commented 1 year ago

This is super, thanks @Rub21 !

@kamicut @danrademacher how should we work out next steps here? Does it help for @kamicut to work on a simple Observable / jupyter notebook that reads the CSV files and generates some basic visualization to help get a better sense of how to use these CSV files? Or should we schedule a time to chat to figure next steps?

@kamicut can you check http://planet-staging.openhistoricalmap.org/ and see if the output files all seem okay to you and we are not missing anything?

@Rub21 -- we probably need to make sure we have CORS enabled on the s3 bucket - if there's an easy way to check, that would be great, otherwise we can deal with it if we run into issues.

danrademacher commented 1 year ago

An observable or jupyter notebook would be great! I think Jeff wanted that as a demo of how others can use this regardless of what we build, so that's a solid next step

Rub21 commented 1 year ago

@Rub21 -- we probably need to make sure we have CORS enabled on the s3 bucket - if there's an easy way to check, that would be great, otherwise we can deal with it if we run into issues.

I have just checked the the CORS in the buckets, request should work fine, with no issues.

danrademacher commented 1 year ago

@Rub21 now that you deployed these stats to production, what are the new URLs for the CSVs?

We have a URL like this on staging: http://planet-staging.openhistoricalmap.org.s3.amazonaws.com/cumulative_elements.csv

Since production Planet replication is at links like this: http://planet.openhistoricalmap.org.s3.amazonaws.com/planet/planet-200514_1030.osm.pbf

I assumed for stats we'd have: http://planet.openhistoricalmap.org.s3.amazonaws.com/cumulative_elements.csv

But that's an error page.

danrademacher commented 1 year ago

I see those links are working now, and I am pulling them into this Google Sheet.

I now @kamicut is is working on an Observable NB version, which is pending resolution of https://github.com/OpenHistoricalMap/issues/issues/536 to be able to pull in production data.

An interesting fact with our production data -- a single user, flha127, mapping roads in France, accounts for 20% of all changesets!

And the top 10 editors account for 43% of all changesets. And the top 50 users account for 73% of all changesets. Not surprising given where OHM is at in its evolution, but certainly notable.

danrademacher commented 1 year ago

Now that the https is working on planet.openhistoricalmap.org, does the Observable still need to be updated to pull from the real production data?

danrademacher commented 1 year ago

I am working on some reprioritization of our backlog, and curious if we should put this back on top of our list, or if @1ec5's nice additions to the Wiki Page, https://wiki.openstreetmap.org/wiki/OpenHistoricalMap#OHM_Web_Services, serve some of this need?

Or if we want to add to that with these granular stats, should we do it in a way that can be embedded into the Wiki?

1ec5 commented 1 year ago

The statistics on the wiki only include a single aggregate number based on this file, so it isn’t quite the same thing as having a leaderboard or something integrated into a user profile. As far as that single aggregate number, though, putting something on the wiki does create a historical record. It’s just a lot more manual than it should be. It would be nice if we could keep archives of these statistics as we do with older planet files.

danrademacher commented 1 year ago

Discussion at GreenInfo:

@tsinn would design this
Propose to host in a separate repo, stats.openhistoricalmap.org
Could be built by GIN team, either Tom, Gregor, or Eric

Key views:

Timestamp of last update
Current counts of nodes, ways, relations, changesets, users
Change over time of nodes, ways, relations, changesets, users

tsinn commented 1 year ago

Placing this wiki stats page here as a reference, and as mentioned by @1ec5 two comments above: https://wiki.openstreetmap.org/wiki/OpenHistoricalMap/Statistics

danrademacher commented 9 months ago

We just put together an initial content outline in this tab of our data sheet: https://docs.google.com/spreadsheets/d/1agos5gTR8PEo3pwiUhxFEbVWHPFDhXdLb8vUJar7LEI/edit#gid=438203943

Tim to work on a design this week.

We also discussed the possibility of using the vector tiles as we have them with an alternative style that uses the Feature ID as a proxy for recency (higher integer ID = more recent). We think this is worth some experimentation since it would require no changes on the data side. If it does not work, we could also look into other options for map views of changeset boundaries, or just not have a map and let folks use History or OSMcha for that (though these are not designed to give an overview beyond the most recent changes)

danrademacher commented 7 months ago

We're finally getting to this!

@tsinn put together a mockup:

And then we also experimented a bit with using Kepler GL to visualize all 90,000 nodes in OHM using the item Id as a proxy for recency.

Here's what that looks like (red is more recent, yellow less recent):

Pretty cool! this achieved two things:

Dot density shows more active areas (vs say overlapping polygons or something)
More intense color shows more recent activity

Note that for the purposes of testing, I think we're using only the nodes that we publish as nodes on their own (so not nodes that are part of ways).

If we wanted to make an interactive map that visualized every node in the database using the object ID as a way to show newer vs older, but also with some time filtering like shown in the mockup, we could do something like this:

Create a method to export every node into a single file, on a schedule. All we would need is lat,lng,id. Maybe that gets regenerated weekly or monthly or daily.
Separately at, say, 12:01 AM on Jan 1 of each year, we could write the max node ID into a static file like year,id
Then we could write Javascript that would tie annual filter controls to the max ID, and the map itself could us the ID to construct links to any node on the OHM website so we could have "view this node on OHM". Admittedly this might drop you pretty deep into the data weeds (like this), but as a first iteration it seems pretty neat.

OpenHistoricalMap / issues

User and changeset stats design #390