OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.89k stars 2.54k forks source link

OGR LVBAG Driver: data input fields extracted and formatted incorrectly #3217

Closed justb4 closed 3 years ago

justb4 commented 3 years ago

BAG Extract is a public dataset freely distributed by Dutch Kadaster. It contains all Buildings and Addresses for The Netherlands in a custom XML format with GML namespaces for geometries. The OGR-Driver LVBAG is meant to read from BAG Extract.

I tried to convert sample BAG data to various formats like PostGIS and GeoJSON but got the problems on each BAG Object type with fields not, or incorrectly appearing in output. Geometries are ok AFAICS.

As example taking Building ("Pand") object type but the problem is for all object types as there is a set of common fields. See further below.

Expected behaviour and actual behaviour.

Expect to convert Building data to e.g. GeoJSON. Relevant excerpts below:

Input sample XML element:

        <Objecten:Pand>
          <Objecten:identificatie domein="NL.IMBAG.Pand">0221100000317614</Objecten:identificatie>
          <Objecten:geometrie>
            <gml:Polygon srsName="urn:ogc:def:crs:EPSG::28992" srsDimension="3" gml:id="PND_0221100000317614_4_1432717947">
              <gml:exterior>
                <gml:LinearRing>
                  <gml:posList count="5">... (left out)</gml:posList>
                </gml:LinearRing>
              </gml:exterior>
            </gml:Polygon>
          </Objecten:geometrie>
          <Objecten:oorspronkelijkBouwjaar>2011</Objecten:oorspronkelijkBouwjaar>
          <Objecten:status>Pand in gebruik</Objecten:status>
          <Objecten:geconstateerd>N</Objecten:geconstateerd>
          <Objecten:documentdatum>2015-05-24</Objecten:documentdatum>
          <Objecten:documentnummer>VMR2015BAG001</Objecten:documentnummer>
          <Objecten:voorkomen>
            <Historie:Voorkomen>
              <Historie:voorkomenidentificatie>4</Historie:voorkomenidentificatie>
              <Historie:beginGeldigheid>2015-05-24</Historie:beginGeldigheid>
              <Historie:eindGeldigheid>2015-06-23</Historie:eindGeldigheid>
              <Historie:tijdstipRegistratie>2015-05-24T17:00:37.000</Historie:tijdstipRegistratie>
              <Historie:eindRegistratie>2015-06-24T17:25:04.000</Historie:eindRegistratie>
              <Historie:BeschikbaarLV>
                <Historie:tijdstipRegistratieLV>2015-05-27T11:12:27.890</Historie:tijdstipRegistratieLV>
                <Historie:tijdstipEindRegistratieLV>2015-06-24T17:30:48.545</Historie:tijdstipEindRegistratieLV>
              </Historie:BeschikbaarLV>
            </Historie:Voorkomen>
          </Objecten:voorkomen>
        </Objecten:Pand>

Expected Properties:

      "properties": {
        "identificatie": "0221100000317614",
        "oorspronkelijkBouwjaar": "2011",
        "status": "Pand in gebruik",
        "geconstateerd": false,
        "documentDatum": "2015-05-24",
        "documentNummer": "VMR2015BAG001",
        "voorkomenIdentificatie": 4,
        "beginGeldigheid": "2015-05-24",
        "eindGeldigheid": "2015-06-23",
        "tijdstipRegistratie": "2015-05-24T17:00:37",
        "eindRegistratie": "2015-06-24T17:25:04",
        "tijdstipRegistratieLV": "2015-05-27T11:12:27.890",
        "tijdstipEindRegistratieLV": "2015-06-24T17:30:48.545"
      },

But got GeoJSON Properties (or PostGIS columns):

      "properties": {
        "oorspronkelijkBouwjaar": "2011-01-01", <== note: should be "2011"!
        "lvID": ".", <== always '.' 
        "status": "Pand in gebruik",
        "geconstateerd": false,
        "documentDatum": "2015-05-24",
        "documentNummer": "VMR2015BAG001",
        "voorkomenIdentificatie": 4,
        "beginGeldigheid": "2015-05-24",
        "eindGeldigheid": "2015-06-23",
        "tijdstipRegistratie": "2015-05-24T17:00:37",
        "eindRegistratie": "2015-06-24T17:25:04",
        "tijdstipRegistratieLV": "2015-05-27T11:12:27.890",
        "tijdstipEindRegistratieLV": "2015-06-24T17:30:48.545"
      },

So problems are:

In Postgres and ogrinfo I see columns always Null for namespace, lokaalid and versie.

Steps to reproduce the problem.

Must use GDAL 3.2.0.

Observe above problems.

It looks like the LVBAG Driver and its test data is outdated? At least the BAG XSDs were recently adapted by Kadaster. See https://github.com/lvbag/BAG-2.0-Extract/tree/master/Vernieuwde%20XSD's . For example in the downloaded sample-data the XML header shows:

<sl-bag-extract:bagStand xmlns:DatatypenNEN3610="www.kadaster.nl/schemas/lvbag/imbag/datatypennen3610/v20200601" 
xmlns:Objecten="www.kadaster.nl/schemas/lvbag/imbag/objecten/v20200601" 
xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:Historie="www.kadaster.nl/schemas/lvbag/imbag/historie/v20200601" 
xmlns:Objecten-ref="www.kadaster.nl/schemas/lvbag/imbag/objecten-ref/v20200601" 
xmlns:nen5825="www.kadaster.nl/schemas/lvbag/imbag/nen5825/v20200601" 
xmlns:KenmerkInOnderzoek="www.kadaster.nl/schemas/lvbag/imbag/kenmerkinonderzoek/v20200601" xmlns:selecties-
extract="http://www.kadaster.nl/schemas/lvbag/extract-selecties/v20200601" xmlns:sl-bag-
extract="http://www.kadaster.nl/schemas/lvbag/extract-deelbestand-lvc/v20200601" 
xmlns:sl="http://www.kadaster.nl/schemas/standlevering-generiek/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-
instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" 
xsi:schemaLocation="http://www.kadaster.nl/schemas/lvbag/extract-deelbestand-lvc/v20200601 
http://www.kadaster.nl/schemas/bag-verstrekkingen/extract-deelbestand-
lvc/v20200601/BagvsExtractDeelbestandExtractLvc-2.1.0.xsd">

Or are there fixes in the LVBAG Driver not in GDAL 3.2.0? Could not relate closed issues to Milestones...

Operating system

Mac OSX 10.14.6 (Mojave)

GDAL version and provenance

GDAL 3.2.0, released 2020/10/26 from standard Homebrew installed on nov 17 2020.

rouault commented 3 years ago

CC @yorickdewid

yorickdewid commented 3 years ago

The initial draft for the BAG 2.0 came with a different header

<Objecten:identificatie>
    <DatatypenNEN3610:namespace>NL.IMBAG.Pand</DatatypenNEN3610:namespace>
    <DatatypenNEN3610:lokaalID>0106100000000000</DatatypenNEN3610:lokaalID>
</Objecten:identificatie>

Both testcases and parser expect this structure. I'll submit a PR reflecting this change.

The new BAG 2.0 should use the lvID identifier which includes the namespace (NL.IMBAG.Pand) and the local identifier. Due to the recent change in the XSD this now is just an empty field. We can rename the field from lvID to identificatie if that's preferred, but this will include the object namespace.

Regarding the build year; the XML data source only contains a year as integer. However the 'date' datatype proves to be more convenient when transforming to a format with date range support. @rouault Any suggestions? I'd propose an open option to not convert year to date.

rouault commented 3 years ago

@rouault Any suggestions?

If the field is a year, I would have just exposed it as a integer I guess. But up to you.

yorickdewid commented 3 years ago

There's another change in the XSD thats affecting the current LVBAG driver. Reference to another dataset is no longer using the xhref style. I'll account for this modification in the same PR.

justb4 commented 3 years ago

@rouault @yorickdewid thanks for your quick replies! This driver will be valuable for many within NL. Especially for the NLExtract Project, which has been working on BAG (v1) conversions/enrichments for over 10 years. We were just about to support BAG v2 when by chance discovered the LVBAG Driver. Now we can use a Stetl Processing Chain (which uses OGR from Python) like for the other national datasets ("Basisregistraties") like BGT, BRK etc. I'll be happy to further support LVBAG development, as there are quite some "nasty design" (issue upcoming) and data quality issues (like '9999' when building-year unknown) within the BAG we found.

Enough talking, my answers below:

yorickdewid commented 3 years ago
  • oorspronkelijkBouwjaar for Pand, prefer int type like in BAG v1 (was numeric(4) in PG BAG v1)

Done

  • identification: my pref is to use fieldname identificatie "BAG id" as 16 char as in BAG v1 without the namespace

One of the major changes in BAG 2.0 design is the namespace for international use. There are similar data sources from other organizations/states which can collide with the BAG 1.0 identifier. Having the namespace prefix would be preferable for new projects. That said, I understand it can be useful for backwards compatibility. I would suggest to add an open option, something like LEGACY_ID=YES, which will then drop the namespace. This makes it more sense to me then to regress to the old BAG 1.0 identifier style.

The current LVBAG driver will already set all building years above 2100 to NULL and displays a warning in the console. There are other issues, for example there is a large collection of invalid geometries. LVBAG will try and repair those polygons, but only if GDAL was build with GEOS support. All BAG identifiers which are not 16bytes are also repaired (an older issue).

justb4 commented 3 years ago
yorickdewid commented 3 years ago
  • identifier: LEGACY_ID=YES would be a good option. Note that the identifier is 16 chars, except for Woonplaats (4-char city code)

There is a check for WPL identifiers.

  • building year: from the XSD and the BAG management chain, ("keten") this field is required, hence civil servants fill in 9999 as a convention when year unknown (just as with missing postal codes). In NLExtract we try to follow the XSD just keep 9999. But no strong opinion for this one.

I'll have a better look at the XSD, but I think we'll keep the NULL value indicating an unknown value rather than an invalid value. This makes date/timestamp conversion much easier. A date can be NULL, but not 9999.

yorickdewid commented 3 years ago

Btw, there are other issues we're discussing. Buildings in Amsterdam can have their build year set to 1005 meaning 'unknown'. We can further improve the repair of polygons. There is a number of valid polygons which are incorrect (too large, surface does not match). For now we've decided this is not something the GDAL driver should fix.

justb4 commented 3 years ago

Yes, like said the BAG is a nasty beast :-(. Heard about the A'dam buildings.

@yorickdewid You are probably aware that the BAG contains its complete mutation history? Nothing is ever removed from the BAG. There are several status fields, within voorkomen structure now called, that determine the validity (actueel) and for several object types their existence, bestaand. e.g. a building (Pand) may be torn down, but is still in the BAG with that status. Some very large building polygons covering half of our country were there in the beginning, so still in the history. So are many invalid geoms.

In NLExtract we filter out the current + existing + valid objects via Postgres VIEWs. See the VIEW definitions here: e.g. for Buildings below.

BAG v2 has added additional status attributes to be added to this puzzle. We could think of several LVBAG Driver config options to do this filtering. This would suppress many invalid geometries as well. At least two config params I am thinking of: for Actueel and Bestaand. Best in separate issue IMO.

Again: great that this work is done!

DROP VIEW IF EXISTS pandactueel;
CREATE VIEW pandactueel AS
    SELECT  pand.gid,
            pand.identificatie,
            pand.aanduidingrecordinactief,
            pand.aanduidingrecordcorrectie,
            pand.officieel,
            pand.inonderzoek,
            pand.documentnummer,
            pand.documentdatum,
            pand.pandstatus,
            pand.bouwjaar,
            pand.begindatumtijdvakgeldigheid,
            pand.einddatumtijdvakgeldigheid,
            pand.geovlak
    FROM pand
    WHERE
      pand.begindatumtijdvakgeldigheid <= now()
      AND (pand.einddatumtijdvakgeldigheid is NULL OR pand.einddatumtijdvakgeldigheid >= now())
      AND pand.aanduidingrecordinactief = FALSE
      AND pand.geom_valid = TRUE;

DROP VIEW IF EXISTS pandactueelbestaand;
CREATE VIEW pandactueelbestaand AS
    SELECT  pand.gid,
            pand.identificatie,
            pand.aanduidingrecordinactief,
            pand.aanduidingrecordcorrectie,
            pand.officieel,
            pand.inonderzoek,
            pand.documentnummer,
            pand.documentdatum,
            pand.pandstatus,
            pand.bouwjaar,
            pand.begindatumtijdvakgeldigheid,
            pand.einddatumtijdvakgeldigheid,
            pand.geovlak
  FROM pand
   WHERE
     pand.begindatumtijdvakgeldigheid <= now()
     AND (pand.einddatumtijdvakgeldigheid is NULL OR pand.einddatumtijdvakgeldigheid >= now())
     AND pand.aanduidingrecordinactief = FALSE
     AND pand.geom_valid = TRUE
     AND (pand.pandstatus <> 'Niet gerealiseerd pand' 
       AND pand.pandstatus <> 'Pand gesloopt'
       AND pand.pandstatus <> 'Bouwvergunning verleend');
akolk commented 3 years ago

Like to add that correcting data, even when it is "incorrect" dates or invalid geometries is probably not the way to go. The owner of the data is the local municipality ("gemeente") if they deliver incorrect data to the LVBAG, that is what should be used. It is up to them to correct it. Fixing problems during a load will hide these issues and may not give the correct/official view of the data, but the corrected view according to the GDAL/LVBAG driver (which is not the same and official :-)).

justb4 commented 3 years ago

Agree, purpose of the driver is to render the BAG as-is. For geometries maybe an geom_valid flag field may help. And like I stated above, the incorrect data will be mainly in the historic records. Some users may even be keen on this historic data like for quality improvement history. Also errors should be submitted here: https://www.kadaster.nl/zakelijk/registraties/basisregistraties/bag/fout-in-bag-melden .

yorickdewid commented 3 years ago

GDAL is not only used in the context of an ETL process to load LVBAG into postgres but also as a geometric processor in spatial computation (see OSGeo sister projects) in which the BAG is just one of the data sources. We've discussed the potential for BAG 2.0 GIS products with the Kadaster back in jan. 2019. For a while there was the idea to ship BAG products as a GeoPackage (which most spatial processors can read without abstraction layer). The idea was eventually dropped. The LVBAG driver fills the gap between the BAG data source and scientific computation libraries such as PDAL/CGAL. GDAL has the option to select, query, prune and validate data before its used in other libraries.

To clarify:

justb4 commented 3 years ago

Ok, clear. To make LVBAG driver useful (see above on history) for direct processing, before any geometry-validation, the current ("actueel") and existing ("bestaand") objects need to be filtered out as a config option, IMHO. I expect this gets rid of the majority of geometry-errors. Famous is the building-polygon covering half of The Netherlands. NLExtract can also produce GeoPackage, as it is based on OGR. Surprises that Kadaster has no interest in this driver, but that can change. Many of its suppliers like FME and E..I use GDAL/OGR internally. Many, many hours have been spent there to get the BAG ETL to PostGIS and INSPIRE under control to drive web-services and APIs via PDOK.

fsteggink commented 3 years ago

@yorickdewid Great job 👍

fsteggink commented 3 years ago

Regarding invalid data in the BAG, like building years 1005 and 9999 and geometries which are obviously wrong, I'd like to stress that also the NLExtract project is leaving this data intact. With the "actueel" and "bestaand" views mentioned by Just, this data is simply hidden (the geometries are at least). Most users aren't interested in historical data, but only current data.

I agree with Just that it would be useful to add filters for current data only, but that is out of the scope of this issue. As well as adding support for mutation files 🙏

yorickdewid commented 3 years ago

Support for mutations was one of the features planned for the next GDAL major. However, its unclear what this will look like. OGR only adds an abstraction layer, but does not add any action to it. When converting from one format to another we could check if the destination supports the update feature. Another solution would be to load the mutations as auxiliary files, then do a 3-way compare with the extract and present the last version of each feature to the library client. The latter will also work in situations where GDAL is used a library, for example QGIS.

justb4 commented 3 years ago

@yorickdewid I think there is a slight confusion here: both me and @fsteggink are not hinting on BAG mutation support. That is a totally different function plus a different input source data XML structure ("mutatie-bestanden"). What we were referring to is that the regular BAG LV XML source data contains complete history of all mutations where each occurrence of a BAG object in time is completely present. "endtime" ("*eindtijd") is set to denote that an object is historic (though "endtime" may be in future as well...). In practice direct parsing and usage will now be experienced as multiple objects/records with the same BAG id. Also the order is not significant. But it is quite straightforward to filter these objects on both existence ("actueel") and presence ("state, like is the building not torn down) plus some other flags ("record inactief"), all within the "voorkomen" structure. That is what NLExtract does in Postgres VIEWs (see above) but that logic can be the same within the LVBAG Driver.

I would even say: for the LVBAG Driver to be useful for any direct purpose like processing, analysis or say loading in QGIS, it needs to filter out the "actueel-bestaande" objects to be useful. Otherwise any direct application will experience "duplicate records" and all faults from the entire history...

justb4 commented 3 years ago

@yorickdewid Looks like changes for identificatie and bouwjaar are in the Laixer gdal branch like this commit solving already part if not all, of this issue.

But the branch contains also many other commits, also of use.

One comment there: LVBAG Driver still refers to outdated XSD: https://github.com/Laixer/gdal/blob/master/gdal/ogr/ogrsf_frmts/lvbag/ogrlvbagdriver.cpp#L61 Latest XSD AFAIK is v20200928 which is I think in line with the above field-changes.

There is off course some merging with the GDAL master branch first required. Is a PR planned? If needed I can test any builds. Easiest for me is via a GDAL Docker image.

yorickdewid commented 3 years ago

All fixes are already made, just need to update the tests. Probably tomorrow.