drolbr / Overpass-API

A database engine to query the OpenStreetMap data.
http://overpass-api.de
GNU Affero General Public License v3.0
716 stars 90 forks source link

Outputting metadata only #626

Open stefanct opened 3 years ago

stefanct commented 3 years ago

I am trying to do some statistics on the history of route relations. I am not interested in the relation members at all, just the history of the metadata, e.g., which users edited which version at which point in time. I am interested in reducing the server load - to be able to gather data from some relations

Some of the relations have a long history with many members (think of (inter)national routes). In "simple" queries one has to use out meta to fetch the necessary information. The standard API's /history returns basically the same as Overpass in that case: the complete history including the complete data of all members. It is thus no alternative either.

An example query I am using is depicted below.

[timeout:10][out:json];
timeline(relation,8816024);
foreach(
  retro(u(t["created"]))(
    (relation(8816024););
    out meta;
  );
);

One way to trim that significantly down is to select exactly what's output by using stat, e.g.,

[timeout:10][out:json];
timeline(relation,8816024);
for (t["created"]){
  retro(_.val)
  {
    rel(8816024);
    make stat version=u(version()),
        timestamp=u(timestamp()),
        user=u(user()),
        changeset=u(changeset());
    out;
  }
}

This help a lot on the client side (if the library would otherwise deserialize the whole dataset into objects), however, AFAICT it does not reduce the load on the server at all. Would it make sense to add another output "modificator" to return the bare minimum of metadata information to allow the server to do less work or can they query be optimized somehow?

For reference, this is what I did so far: https://github.com/stefanct/osm_refhistorymeta/blob/main/ref_contributors.py

PS: I found the documentation concerning retro a bit lacking. For example, I have no idea why the examples sometimes use foreach vs. for (like the two above). Using the full name of u and t would also have made things a bit easier to grasp for me. PPS: In the official documentation there is a hint to out noids and this does indeed remove the IDs of the relation and its members but that's only a fraction of what is returned (e.g., all tags are still there) and probably does not reduce the server's load either. PPPS: This is somewhat related to #189 but only very loosely.

mmd-osm commented 3 years ago

First of all, retro and timeline are only providing the minimum viable product (or baseline) functionality, see https://dev.overpass-api.de/blog/sliced_time_and_space.html#timeline - iterating over several hundred versions triggers a reconstruction of the relation at each point in time, which is rather expensive.

I'm posting the results of your script from a local test run for your reference. https://gist.github.com/mmd-osm/5327e534807b8c45ed015c7b2956cac9 - it only took a few minutes to process.

In theory, the file contents of a file called relations_meta_attic.bin would be sufficient for your use case. It's only 259MB large, and is available from https://dev.overpass-api.de/clone/. It requires some custom C++ code, though.

stefanct commented 3 years ago

I am not entirely sure how to interpret your reply and think maybe there is a misunderstanding. I don't necessarily need help for this particular case. I worked around the biggest problems (timeouts and request rate limiting) and as you could see it works fairly ok. It takes only a few minutes because it skips over the really big relations. I don't necessarily need them here but each of them alone takes several minutes - and some don't even finish within 10 minutes (each!).

The intent of my report was rather to spark a discussion of how such use cases could be improved in general within Overpass. If I understand correctly, then all necessary metadata needed is contained in independent files (+ the indices I guess). That means that one could actually interact with them locally in a custom application without the need for all others (specifically w/o the non-attic version) - but that also means that the OP server has access to these data without the need to merge a lot of different "tables", right?

mmd-osm commented 3 years ago
timeline(relation,8816024);
out;

would have already printed all relevant details, except for the user id, and the changeset. Adding both fields is a two-line change, the data is available anyway at that point.

stefanct commented 3 years ago

Just to make to clear... I am satisfied and the script will only be executed maybe another dozen times over the next weeks or so. It is kind of a one-time hack. The purpose of the issue was really just to show an actual use case for this kind of query in case you get bored. ;) But it's great to know and have it publicly documented that there is a much fast alternative in any case. Thanks, bye.