Closed justb4 closed 3 years ago
CC @yorickdewid
The initial draft for the BAG 2.0 came with a different header
<Objecten:identificatie>
<DatatypenNEN3610:namespace>NL.IMBAG.Pand</DatatypenNEN3610:namespace>
<DatatypenNEN3610:lokaalID>0106100000000000</DatatypenNEN3610:lokaalID>
</Objecten:identificatie>
Both testcases and parser expect this structure. I'll submit a PR reflecting this change.
The new BAG 2.0 should use the lvID
identifier which includes the namespace (NL.IMBAG.Pand
) and the local identifier. Due to the recent change in the XSD this now is just an empty field. We can rename the field from lvID
to identificatie
if that's preferred, but this will include the object namespace.
Regarding the build year; the XML data source only contains a year as integer. However the 'date' datatype proves to be more convenient when transforming to a format with date range support. @rouault Any suggestions? I'd propose an open option to not convert year to date.
@rouault Any suggestions?
If the field is a year, I would have just exposed it as a integer I guess. But up to you.
There's another change in the XSD thats affecting the current LVBAG driver. Reference to another dataset is no longer using the xhref style. I'll account for this modification in the same PR.
@rouault @yorickdewid thanks for your quick replies! This driver will be valuable for many within NL. Especially for the NLExtract Project, which has been working on BAG (v1) conversions/enrichments for over 10 years. We were just about to support BAG v2 when by chance discovered the LVBAG Driver. Now we can use a Stetl Processing Chain (which uses OGR from Python) like for the other national datasets ("Basisregistraties") like BGT, BRK etc. I'll be happy to further support LVBAG development, as there are quite some "nasty design" (issue upcoming) and data quality issues (like '9999' when building-year unknown) within the BAG we found.
Enough talking, my answers below:
oorspronkelijkBouwjaar
for Pand, prefer int
type like in BAG v1 (was numeric(4)
in PG BAG v1)identificatie
"BAG id" as 16 char as in BAG v1 without the namespacenamespace
still to be in, make it a separate field but it always has the same value...and is implicit
oorspronkelijkBouwjaar
for Pand, preferint
type like in BAG v1 (wasnumeric(4)
in PG BAG v1)
Done
- identification: my pref is to use fieldname
identificatie
"BAG id" as 16 char as in BAG v1 without the namespace
One of the major changes in BAG 2.0 design is the namespace for international use. There are similar data sources from other organizations/states which can collide with the BAG 1.0 identifier. Having the namespace prefix would be preferable for new projects. That said, I understand it can be useful for backwards compatibility. I would suggest to add an open option, something like LEGACY_ID=YES
, which will then drop the namespace. This makes it more sense to me then to regress to the old BAG 1.0 identifier style.
The current LVBAG driver will already set all building years above 2100 to NULL and displays a warning in the console. There are other issues, for example there is a large collection of invalid geometries. LVBAG will try and repair those polygons, but only if GDAL was build with GEOS support. All BAG identifiers which are not 16bytes are also repaired (an older issue).
LEGACY_ID=YES
would be a good option. Note that the identifier is 16 chars, except for Woonplaats
(4-char city code)
- identifier:
LEGACY_ID=YES
would be a good option. Note that the identifier is 16 chars, except forWoonplaats
(4-char city code)
There is a check for WPL identifiers.
- building year: from the XSD and the BAG management chain, ("keten") this field is required, hence civil servants fill in 9999 as a convention when year unknown (just as with missing postal codes). In NLExtract we try to follow the XSD just keep 9999. But no strong opinion for this one.
I'll have a better look at the XSD, but I think we'll keep the NULL value indicating an unknown value rather than an invalid value. This makes date/timestamp conversion much easier. A date can be NULL, but not 9999.
Btw, there are other issues we're discussing. Buildings in Amsterdam can have their build year set to 1005 meaning 'unknown'. We can further improve the repair of polygons. There is a number of valid polygons which are incorrect (too large, surface does not match). For now we've decided this is not something the GDAL driver should fix.
Yes, like said the BAG is a nasty beast :-(. Heard about the A'dam buildings.
geom_valid
to each object type with a geometry. @yorickdewid You are probably aware that the BAG contains its complete mutation history? Nothing is ever removed from the BAG. There are several status fields, within voorkomen
structure now called, that determine the validity (actueel
) and for several object types their existence, bestaand
. e.g. a building (Pand) may be torn down, but is still in the BAG with that status. Some very large building polygons covering half of our country were there in the beginning, so still in the history. So are many invalid geoms.
In NLExtract we filter out the current + existing + valid objects via Postgres VIEWs. See the VIEW definitions here: e.g. for Buildings below.
BAG v2 has added additional status attributes to be added to this puzzle. We could think of several LVBAG Driver config options to do this filtering. This would suppress many invalid geometries as well. At least two config params I am thinking of: for Actueel and Bestaand. Best in separate issue IMO.
Again: great that this work is done!
DROP VIEW IF EXISTS pandactueel;
CREATE VIEW pandactueel AS
SELECT pand.gid,
pand.identificatie,
pand.aanduidingrecordinactief,
pand.aanduidingrecordcorrectie,
pand.officieel,
pand.inonderzoek,
pand.documentnummer,
pand.documentdatum,
pand.pandstatus,
pand.bouwjaar,
pand.begindatumtijdvakgeldigheid,
pand.einddatumtijdvakgeldigheid,
pand.geovlak
FROM pand
WHERE
pand.begindatumtijdvakgeldigheid <= now()
AND (pand.einddatumtijdvakgeldigheid is NULL OR pand.einddatumtijdvakgeldigheid >= now())
AND pand.aanduidingrecordinactief = FALSE
AND pand.geom_valid = TRUE;
DROP VIEW IF EXISTS pandactueelbestaand;
CREATE VIEW pandactueelbestaand AS
SELECT pand.gid,
pand.identificatie,
pand.aanduidingrecordinactief,
pand.aanduidingrecordcorrectie,
pand.officieel,
pand.inonderzoek,
pand.documentnummer,
pand.documentdatum,
pand.pandstatus,
pand.bouwjaar,
pand.begindatumtijdvakgeldigheid,
pand.einddatumtijdvakgeldigheid,
pand.geovlak
FROM pand
WHERE
pand.begindatumtijdvakgeldigheid <= now()
AND (pand.einddatumtijdvakgeldigheid is NULL OR pand.einddatumtijdvakgeldigheid >= now())
AND pand.aanduidingrecordinactief = FALSE
AND pand.geom_valid = TRUE
AND (pand.pandstatus <> 'Niet gerealiseerd pand'
AND pand.pandstatus <> 'Pand gesloopt'
AND pand.pandstatus <> 'Bouwvergunning verleend');
Like to add that correcting data, even when it is "incorrect" dates or invalid geometries is probably not the way to go. The owner of the data is the local municipality ("gemeente") if they deliver incorrect data to the LVBAG, that is what should be used. It is up to them to correct it. Fixing problems during a load will hide these issues and may not give the correct/official view of the data, but the corrected view according to the GDAL/LVBAG driver (which is not the same and official :-)).
Agree, purpose of the driver is to render the BAG as-is. For geometries maybe an geom_valid
flag field may help. And like I stated above, the incorrect data will be mainly in the historic records. Some users may even be keen on this historic data like for quality improvement history. Also errors should be submitted here: https://www.kadaster.nl/zakelijk/registraties/basisregistraties/bag/fout-in-bag-melden .
GDAL is not only used in the context of an ETL process to load LVBAG into postgres but also as a geometric processor in spatial computation (see OSGeo sister projects) in which the BAG is just one of the data sources. We've discussed the potential for BAG 2.0 GIS products with the Kadaster back in jan. 2019. For a while there was the idea to ship BAG products as a GeoPackage (which most spatial processors can read without abstraction layer). The idea was eventually dropped. The LVBAG driver fills the gap between the BAG data source and scientific computation libraries such as PDAL/CGAL. GDAL has the option to select, query, prune and validate data before its used in other libraries.
To clarify:
AUTOCORRECT_INVALID_DATA=NO
to disable this functionality)Ok, clear. To make LVBAG driver useful (see above on history) for direct processing, before any geometry-validation, the current ("actueel") and existing ("bestaand") objects need to be filtered out as a config option, IMHO. I expect this gets rid of the majority of geometry-errors. Famous is the building-polygon covering half of The Netherlands. NLExtract can also produce GeoPackage, as it is based on OGR. Surprises that Kadaster has no interest in this driver, but that can change. Many of its suppliers like FME and E..I use GDAL/OGR internally. Many, many hours have been spent there to get the BAG ETL to PostGIS and INSPIRE under control to drive web-services and APIs via PDOK.
@yorickdewid Great job 👍
Regarding invalid data in the BAG, like building years 1005 and 9999 and geometries which are obviously wrong, I'd like to stress that also the NLExtract project is leaving this data intact. With the "actueel" and "bestaand" views mentioned by Just, this data is simply hidden (the geometries are at least). Most users aren't interested in historical data, but only current data.
I agree with Just that it would be useful to add filters for current data only, but that is out of the scope of this issue. As well as adding support for mutation files 🙏
Support for mutations was one of the features planned for the next GDAL major. However, its unclear what this will look like. OGR only adds an abstraction layer, but does not add any action to it. When converting from one format to another we could check if the destination supports the update
feature. Another solution would be to load the mutations as auxiliary files, then do a 3-way compare with the extract and present the last version of each feature to the library client. The latter will also work in situations where GDAL is used a library, for example QGIS.
@yorickdewid I think there is a slight confusion here: both me and @fsteggink are not hinting on BAG mutation support. That is a totally different function plus a different input source data XML structure ("mutatie-bestanden"). What we were referring to is that the regular BAG LV XML source data contains complete history of all mutations where each occurrence of a BAG object in time is completely present. "endtime" ("*eindtijd") is set to denote that an object is historic (though "endtime" may be in future as well...). In practice direct parsing and usage will now be experienced as multiple objects/records with the same BAG id. Also the order is not significant. But it is quite straightforward to filter these objects on both existence ("actueel") and presence ("state, like is the building not torn down) plus some other flags ("record inactief"), all within the "voorkomen" structure. That is what NLExtract does in Postgres VIEWs (see above) but that logic can be the same within the LVBAG Driver.
I would even say: for the LVBAG Driver to be useful for any direct purpose like processing, analysis or say loading in QGIS, it needs to filter out the "actueel-bestaande" objects to be useful. Otherwise any direct application will experience "duplicate records" and all faults from the entire history...
@yorickdewid Looks like changes for identificatie
and bouwjaar
are in the Laixer gdal branch like this commit solving already part if not all, of this issue.
But the branch contains also many other commits, also of use.
One comment there: LVBAG Driver still refers to outdated XSD: https://github.com/Laixer/gdal/blob/master/gdal/ogr/ogrsf_frmts/lvbag/ogrlvbagdriver.cpp#L61 Latest XSD AFAIK is v20200928 which is I think in line with the above field-changes.
There is off course some merging with the GDAL master branch first required. Is a PR planned? If needed I can test any builds. Easiest for me is via a GDAL Docker image.
All fixes are already made, just need to update the tests. Probably tomorrow.
BAG Extract is a public dataset freely distributed by Dutch Kadaster. It contains all Buildings and Addresses for The Netherlands in a custom XML format with GML namespaces for geometries. The OGR-Driver
LVBAG
is meant to read from BAG Extract.I tried to convert sample BAG data to various formats like PostGIS and GeoJSON but got the problems on each BAG Object type with fields not, or incorrectly appearing in output. Geometries are ok AFAICS.
As example taking Building ("Pand") object type but the problem is for all object types as there is a set of common fields. See further below.
Expected behaviour and actual behaviour.
Expect to convert Building data to e.g. GeoJSON. Relevant excerpts below:
Input sample XML element:
Expected Properties:
But got GeoJSON Properties (or PostGIS columns):
So problems are:
identificatie
16 char, missingoorspronkelijkBouwjaar
should be single Year, not YY-MM-DDlvID
?In Postgres and
ogrinfo
I see columns always Null fornamespace
,lokaalid
andversie
.Steps to reproduce the problem.
Must use GDAL 3.2.0.
BAGGEM0221L-15092020.zip
(Doesburg)0221PND15092020-000002.xml
ogr2ogr -f GeoJSON 0221PND15092020-000002.json 0221PND15092020-000002.xml
ogrinfo -al data/lv/0221PND15092020-000002.xml | more
Observe above problems.
It looks like the LVBAG Driver and its test data is outdated? At least the BAG XSDs were recently adapted by Kadaster. See https://github.com/lvbag/BAG-2.0-Extract/tree/master/Vernieuwde%20XSD's . For example in the downloaded sample-data the XML header shows:
Or are there fixes in the LVBAG Driver not in GDAL 3.2.0? Could not relate closed issues to Milestones...
Operating system
Mac OSX 10.14.6 (Mojave)
GDAL version and provenance
GDAL 3.2.0, released 2020/10/26 from standard Homebrew installed on nov 17 2020.