Enhancing Data Extraction Endpoints

FabiKo117 commented 4 years ago

Through the recent discussions in https://github.com/GIScience/ohsome-api/pull/18, we've defined a need for a more structured way to represent the needed functionality to request modified data. We see the need for additional endpoints listed below. We've chosen to add these features to a new /contributions/ endpoint as this fits well with future directions of upcoming features.

[x] /contributions/geometry/ for the modifications as GeoJSON
[x] /contributions/latest/geometry/ for the "latest" changes only, as GeoJSON

Here, geometry could be extended to also allow centroid or bbox.

Perhaps in the future the following endpoints could be added as well (specifics are to be defined still):

[x] /contributions/count for modification counts as JSON -> #121
[ ] /contributions/count/groupBy/contributionType for modification counts as JSON grouped by contribution type (creation/modification/deletion)
[ ] /contributions/ for the modifications as annotated JSON (like OsmChange XML – open question: how does the format look like exactly?)
[x] /contributions/latest/ for the "latest" modifications as annotated JSON (like OsmChange XML)

FabiKo117 commented 4 years ago

I've started to implement the /contributions endpoint and stumbled upon some thoughts during that.

So what is the difference between the response of /fullHistory vs. the response here of /contributions? As I see it, it's only in the properties of each response feature, because you would get the same features in both endpoints for the same parameters. Currently the /fullHistory endpoint also gives one feature for each modification, it just does not add the contribution type and the contribution time to it.

So I thought it could make sense here to integrate now /fullHistory under the /contributions name and think about what kind of properties each feature must have (and could have depending on what the user wants to know). Open for your opinions @tyrasd @rtroilo

tyrasd commented 4 years ago

So what is the difference between the response of /fullHistory vs. the response here of /contributions?

To me the main difference is that /elementsFullHistory/ not only returns the changes in the data, but also the unchanged data: i.e. if you request only a short time frame, then you will get the state as it was as the start of the time frame, and you get all changes to end up with the state at the end of the requested time frame. On the other hand the /contributions endpoint does not return any elements which are not touched in the time frame.

So I thought it could make sense here to integrate now /fullHistory under the /contributions

I've thought about this as well, but I'm not sure if this would "add" something overall: while it reduces the amount of resources, it makes the remaining ones harder to use (because there are more options to consider, or because in order to acchieve the result from /elementsFullHistory/ requires two API calls instead of one). So, at the moment I'd rather keep the additional endpoint, if there is nothing else I've overlooked. Perhaps the name of the elementsFullHistory resources could be improved, though :thinking:

FabiKo117 commented 4 years ago

Ah true, I forgot about the snapshot of the data that you get at the start timestamp... Hm yeah then it's probably better to leave them separated for now.

Another discussion point would be what to include in the response. Had some discussion now with rafael as well and he added quite a lot of attributes in an analysis that he did. Here's the example of one feature:

{ 
  "@osmId":100,
  "@osmType":"WAY",
  "@version":2,
  "@minorVersion":1, // minor edit
  "@uid":123,
  "@changesetId":1230,
  "@contribUid":444, // ContributionUser
  "@contribChangesetId":4440, // ContributionChangeset
  "@validFrom":"2018-06-20T09:48:36",
  "@validTo":"2018-07-01T06:44:41",
  "@creation":true, // ContributionType
  "@filterPass":"geom",
  "@geometry_change":true, // ContributionType
  "@deletion":true, // ContributionType of nextVersion
  "@deletionUid":666, // ContributionUser of nextVersion
  "@deletionChangesetId":6660, // ContributionChangeset of nextVersion
  "@filterFail":"geom",
  "building":"yes"
}

This looks like quite a lot of attributes that we could add as well, e.g. splitting the contributionType already in the properties (as one contribution could include a geometry and a tag change) or adding something like "@filterFail":"geom" to state that the featurea just has been moved out of the requested area and not really been deleted.

tyrasd commented 4 years ago

This looks like quite a lot of attributes that we could add as well

I think the splitting of the contribution type into several booleans is quite elegant and intuitive. :+1:
I guess it could be avoided to special-case the @deletions "contribution type of the next version" if we model them as null geometries (or instead as pre-deletion geometries – whatever makes more sense).
@filterPass and @filterFail – what were these used for?
@minorVersion see above, maybe replace with @contributionTimestamp?
@uid might be problematic :roll_eyes:

rtroilo commented 4 years ago

what you see here is a feature:

validFrom: timestamp of the minor edit contribution, it was a geometry change, which "moved" this feature into my area of interest -> ContributionType "creation"/"geometry_change" and the reason for the creation where a "filterPass": "geom" because it is now within my filter constrains, it wasn't before with the constrains but existed.
validTo: timestamp of the next contribution, the next contribution was also a geometry_change which moved the feature out of my area of interest again -> ContributionType "deletion", "filterFail":"geom".

there is also a "filter(Pass/Fail)":"tag" in this case for example if someone change or remove the tag "building:yes" there will be never a filter(Pass/Fail) for tag and geom at the same time / contribution.

@minorVersion is just an increased number of minor edits for the same feature version within my time frame.

@uid yes it was for internal use only :see_no_evil:

I see the confusion that the contribution of the deletion sneaked into the contribution just before, but deleted features never have tags nor geometry and the only information what they bring is the deletion timestamp (@validTo) and the changesetId (internal te userid), so in my use case it made sense to bring this information to the previous version because the validTo where already there.

FabiKo117 commented 4 years ago

So I guess we should decide on a set of properties that we want to give in the public response.

{ 
  "@osmId":100,
  "@osmType":"WAY",
  "@version":2,
  "@changesetId":1230,
  "@creation":true, // ContributionType
  "@timestamp":"2018-06-20T09:48:36", // instead of validFrom
  "@geometryChange":true, // ContributionType
  "building":"yes"
}

This would be my proposition of attributes for one contribution. Under an endpoint called /contributions I would expect one feature representing one specific contribution. Therefore I would not like to add information about other contributions on this feature to the properties. Also, is not the combination of "(at)creation" and "@geometryChange" sufficient to know that this feature has just been moved into my bbox and is therefore included now and not been created newly? Would it be possible then to add a contribution also for a minor edit? There I would add then additionally "@minorVersion:1" and keep the same "(at)version". Did I forget or missunderstand something?

rtroilo commented 4 years ago

Also, is not the combination of "(at)creation" and "@geometryChange" sufficient to know that this feature has just been moved into my bbox.

That's true, I guess this this combination should be enougth,

What is the @changesetId ? Is it the changeset of the contribution? And @contributionTimestamp would it not better to call it just @timestamp

Would it be possible then to add a contribution also for a minor edit?

Actually your example was/is most likely a minor edit. So what do you mean with add a contribution also for minor edits?

joker234 commented 4 years ago

Another question: How do you handle keys starting with @? At least some of them are in the keytables database. I see two possibilities here:

escape these rare keys with another @, e.g. @diehummel → @@diehummel
put all tags in another JSON Object, which is allowed in GeoJSON, but maybe not recognized by some parsers not following the RFC totally, e.g.:
```
{ 
"@osmId":100,
…
"@geometryChange":true, // ContributionType
"tags": {
"building":"yes",
"@diehummel":"something"
}
}
```

tyrasd commented 4 years ago

@joker234 I would prefer option 1, becuase these keys are so rare and not normally used in OSM tags. Taginfo even considers these problematic and if I'm not mistaken than currently there are none to be found in the OSM database. option 2 would be technically possible, but has the major downside that almost no software will support it.

rtroilo commented 4 years ago

I also prefer option 1, I think tools like qgis couldn't handle nested properties.

if I'm not mistaken than currently there are none to be found in the OSM database

At least in our keyvalue table you will find tags with @osmId and other @.... I guess this are from the early days of osm.

rtroilo commented 4 years ago

What is the use case for this endpoint or how should the user use it. For example if you want to visualize the changes over time, it is not so easy with the current design because of the missing @validTo . With the fullHistory endpoint and the timemanager plugin I used a @validFrom < timestamp < @validTo filter. Even described in our first blogpost. ohsome-part1

But @validTo is actually the timestamp of the following contribution!

FabiKo117 commented 4 years ago

What is the @changesetId ? Is it the changeset of the contribution?

Yes from that particular contribution.

And @contributionTimestamp would it not better to call it just @timestamp

True, I can just call it "timestamp" as well.

Actually your example was/is most likely a minor edit. So what do you mean with add a contribution also for minor edits?

Not necessarily :) But yeah I would then add the property "(at)minorVersion" if it's a contribution that does not result in a version number increase.

About the issue with "@": I guess we only have an issue if a tag uses the same key that we already have in use. Then we could apply the escaping with the double @@, but otherwise it's not needed. Still we would alter then the OSM data and would have to at least document that in our docs (I guess it's not possible in the response as the data is streamed).

FabiKo117 commented 4 years ago

What is the use case for this endpoint or how should the user use it.

use case: project of @Zia- if someone is just interested in the changes that happen in a specific region and then especially only in the last change. The data extraction here is just the first step, following steps are to make aggregations as well, so having a /count with different groupings, like /groupBy/contributionType.

rtroilo commented 4 years ago

Sure you are right, we don't need a validTo for the

/contributions/latest/geometry/

endpoint :-) But for the other it would make sense.

following steps are to make aggregations as well, so having a /count with different groupings, like /groupBy/contributionType

you are talking about a new aggregation endpoint and not a dataextraction, right?

tyrasd commented 4 years ago

For example if you want to visualize the changes over time

I agree that this use case benefits from a @validTo timestamp, that's why we include it in the /elementsFullHistory/ endpoint. But the /contributions/ endpoints is meant to be used for use-cases which only look at the edit events of the data. I think there is no need to replicate what we already have implemented nicely, is there?

FabiKo117 commented 4 years ago

Yeah exactly, this can be achieved better through using the /elementsFullHistory endpoint and we don't need to re-implement what's already possible with another endpoint.

Then I will go now for the proposed properties and implement that.

Zia- commented 4 years ago

What is the use case for this endpoint or how should the user use it.

use case: project of @Zia- if someone is just interested in the changes that happen in a specific region and then especially only in the last change. The data extraction here is just the first step, following steps are to make aggregations as well, so having a /count with different groupings, like /groupBy/contributionType.

Just to make it clear, I am not intending to go beyond Ohsome's data extraction as I'm already doing subsequent aggregation on my own. However, if you guys are planning different grouping anyway, fair enough +1

FabiKo117 commented 4 years ago

Sure, the possibility to perform data aggregation as well on contributions in future was just one of the reasons why we've decided to go for a new endpoint called /contribution instead of adding that through a parameter to the existing /elementsFullHistory endpoint.

FabiKo117 commented 4 years ago

I've implemented now a working solution for /contributions and /contributions/latest. The URL that I've used to test my code was the following: http://localhost:8080/contributions/geometry?bboxes=8.687337,49.415067,8.687493,49.415172&filter=building=*&time=2010-01-01,2016-06-01&showMetadata=yes&properties=metadata,tags,unclipped

And with this example you can also use the /contributions/latest endpoint and change the timerange to see if you always only get the latest change.

The code can be found in this branch. I will still do some refactoring as it looks quite nested and complicated right now. Would be nice if you could test it a bit to further check if it's working as it should and delivers what's expected.

SlowMo24 commented 3 years ago

So what is the difference between the response of /fullHistory vs. the response here of /contributions?

To me the main difference is that /elementsFullHistory/ not only returns the changes in the data, but also the unchanged data: i.e. if you request only a short time frame, then you will get the state as it was as the start of the time frame, and you get all changes to end up with the state at the end of the requested time frame. On the other hand the /contributions endpoint does not return any elements which are not touched in the time frame.

So I thought it could make sense here to integrate now /fullHistory under the /contributions

I've thought about this as well, but I'm not sure if this would "add" something overall: while it reduces the amount of resources, it makes the remaining ones harder to use (because there are more options to consider, or because in order to acchieve the result from /elementsFullHistory/ requires two API calls instead of one). So, at the moment I'd rather keep the additional endpoint, if there is nothing else I've overlooked. Perhaps the name of the elementsFullHistory resources could be improved, though thinking

Currently this information is missing in the documentation. I would suggest to extend the documentation with this information (and the ones following this comment)! Users will be confused by these slight differences if not mentioned.

GIScience / ohsome-api

Enhancing Data Extraction Endpoints #23