sorting column for paginated WFS requests

bcgov / bcdata

An R package for searching & retrieving data from the B.C. Data Catalogue

https://bcgov.github.io/bcdata

Apache License 2.0

81 stars 12 forks source link

sorting column for paginated WFS requests #76

Closed smnorris closed 5 years ago

smnorris commented 5 years ago

I've taken your logic for getting a sortby column and used it in Python bcdata - just take the first column:

sorting_col <- x$obj[["details"]][["column_name"]][1]
query_list <- c(query_list, sortby = sorting_col)

But for paginated requests, I'm getting duplicated (and presumably missed) features for this table WHSE_FOREST_TENURE.FTEN_ROAD_SEGMENT_LINES_SVW. The first column FOREST_FILE_ID is not unique. Sorting by OBJECTID instead fixes the problem. I believe this is an ESRI generated field - I haven't checked but it is likely to be present in all WFS layers.

smnorris commented 5 years ago

Ok, looks like first column returned in R is id... I should be filing this issue in my package not here!

ateucher commented 5 years ago

Oh this is probably something we should check and making sure it's not happening to us! Thanks, I'm going to re-open just to make sure. @boshek you did the pagination stuff - what do you think?

smnorris commented 5 years ago

It is tricky because feature counts are correct. I only noticed because I was getting duplicate culverts when intersecting roads with streams.

boshek commented 5 years ago

Indeed I just noticed that too. I was never particularly happy with this solution:

sorting_col <- x$obj[["details"]][["column_name"]][1]

@webgismd can you comment on a way to generalize the sorting column? Ideally we'd be able use the primary key. Is OBJECTID present in every wfs layer and would that be a good column to use as a primary key in this case?

webgismd commented 5 years ago

I would use OBJECTID yes. I think the last BCGW upgrade added objectid columns to all spatial feature classes that did not have them. But I could confirm with some of the DAs on this. But I think OBJECTID is a reasonable sort column for pagination.

ateucher commented 5 years ago

Fantastic, thanks for being so responsive @webgismd - if you don't mind confirming that would be great. Otherwise we could use OBJECTID by default and have a fallback (of the first column?) just in case it doesn't exist?

smnorris commented 5 years ago

I'm just running this to check:

import bcdata
from owslib.wfs import WebFeatureService
import click

tables = bcdata.list_tables()
with click.progressbar(tables) as bar:
    for table in bar:
        wfs = WebFeatureService(url=bcdata.OWS_URL, version="2.0.0")
        columns = list(wfs.get_schema("pub:" + table)["properties"].keys())
        if "OBJECTID" not in columns:
            print("{table} does not include OBJECTID column".format(table=table))

webgismd commented 5 years ago

Just checked and all but about 30 objects do not have OBJECTID.. Except for two objects of these,, they all start with GSR in the table name and use the name SEQUENCE_ID for the same function as OBJECTID. (the other two would fit the idea about - taking the first field)..but it is a standard of the DAs on net new objects to include OBJECTID as a unique index.

ateucher commented 5 years ago

Brilliant, great to have that clarity - thanks @webgismd!

smnorris commented 5 years ago

Yes thanks very much @webgismd. Hopefully the multi-threaded, multi-page requests aren't too taxing for the service.

webgismd commented 5 years ago

It would be good to have a heads up.. how many threads? Is the output format always geojson? Curious if the upgrade to the Distribution Service (which will have an API) would be more efficient.. when it is in production.. (it is currently in test if your team is interested in testing it)

smnorris commented 5 years ago

I default to 5 threads for multi page requests but that is over here https://github.com/smnorris/bcdata.