ckan / ckanext-spatial

Geospatial extension for CKAN
http://docs.ckan.org/projects/ckanext-spatial
126 stars 193 forks source link

Support of ISO19115-3:2016 Geospatial Standard #212

Open camfindlay opened 5 years ago

camfindlay commented 5 years ago

What would be require to add support for this standard in this module?

We have agencies that are outputting CSW originating from this standard and the current CSW harvest in this module error when attempting to ingest this.

I note this is referenced over in this GH issue too https://github.com/GSA/data.gov/issues/764

Keen to get a read on what's required so I can look at potentially tackling this soon.

Thanks all.

kalxas commented 5 years ago

A good starting point would be implementing support for ISO19115-3:2016 in OWSLib.

camfindlay commented 5 years ago

@kalxas thanks, as in it's not supported in https://github.com/geopython/OWSLib ? and the spatial extension for CKAN is using this as a dependancy? If I look to get this solved, would be great to make sure peoples input, knowledge and approaches can be included :) Any other insight into chasing this would be appreciated (feel free to brain dump here).

kalxas commented 5 years ago

Actually there is an iso parser in https://github.com/ckan/ckanext-spatial/blob/master/ckanext/spatial/model/harvested_metadata.py but OWSLib is also used in other places.

fostermh commented 3 years ago

It's a bit of a hack but as part of setting up cioos.ca we have forked ckanext-spatial and started modifying it to support harvesting iso19115-3. https://github.com/cioos-siooc/ckanext-spatial

Most of the changes to date have been in https://github.com/cioos-siooc/ckanext-spatial/blob/cioos-siooc/ckanext/spatial/model/harvested_metadata.py. I would love a figure out a way to do this so that we could maintain support for iso19139 as well and contribute back to the community.

ccancellieri commented 2 years ago

Hi @fostermh, what's the status of this fork? is it implementing iso19115-3 completely?

If so I could try to provide a pull req based on your fork very soon.

fostermh commented 2 years ago

It's not 100% coverage, unfortunately. It does handle most of the required and recommended fields but some of the optional fields are not currently part of our standard and so have not yet been addressed. At a quick look, there appear to be around 25 fields that do not currently have an iso19115-3 xpath. It would probably take a day or two to finish that up. If there is interest I could work on finishing up the last few fields next week.

ccancellieri commented 2 years ago

I'm checking changes, I see quite a lot of interesting stuff, maybe I can cherry pick some other than https://github.com/ckan/ckanext-spatial/compare/master...ccancellieri:cioos-siooc?expand=1

Any suggestions? (f.e.: I see an interesting GeoNetworkHarvester, currently I was planning to use CSW with a full profile but I'm not sure it will ship all the iso fields)

Should I / would you like to create a raw pull req (only to share comments and improvements)

looking forward for comments / suggestions.

Note: I'd love to keep PR as small and focused as possible so it will be easy to merge and review.

fostermh commented 2 years ago

I think if the focus is ISO19115-3 support, then ckanext-spatial/ckanext/spatial/model/harvested_metadata.py is likely the only file of interest. Yes, we did quite a bit with harvesters. That was more related to how data flows between organizations in CIOOS so probably best to ignore it for the moment. No need to confuse the issue more than it already is. :-)

A raw pull request with just the minimal changes needed sounds like a great idea. I will put a PR together and tag this issue when it's ready.

ccancellieri commented 2 years ago

Great, quick note, I'm testing a port of that file to my branch but looks like it requires (over Alpine-3.14):

apk add gdal
apk add gdal-dev
pip install gdal

due to osgeo import:

def infer_spatial(self, values):
...
            try:
                geom = ogr.CreateGeometryFromGML(xmlGeom)
            except Exception:
                try:
                    geom = ogr.CreateGeometryFromWkt(xmlGeom)
                except Exception:
                    try:
                        geom = ogr.CreateGeometryFromJson(xmlGeom)
...

Adding so many dependencies can be tricky (but can be discussed with @amercader and maintainers) do you think that feature is necessary or not reproducible withoud gdal?

Thanks

fostermh commented 2 years ago

It seems likely to me that spatial information will come in the form of GML, I added a couple of other options because I could but we are using GML internally I think. It seemed troublesome to store the GML in solr, and not very useful down the line, so I converted to GeoJSON. I used ogr for this. If we only support GML input then a custom function to parse it would be possible for sure. So.. yes I think it could be done without osgeo but seemed like a lot of work at the time.

ccancellieri commented 2 years ago

Ciao, hope this helps:

I'm porting all the necessary changes to the schema (heavily based on your work) here: https://github.com/ckan/ckanext-spatial/compare/master...ccancellieri:my_master?expand=1

That is 'my_master' and is shipping also some minor improvement I'm providing as isolated pull request (mainly over csw), but necessary to go on supporting other types and ... in my opinion not bad to have. Those will be wiped out from this branch once merged over master... @amercader I'm sorry about pushing :)

Focusing on the ckanext/spatial/model/harvested_metadata.py I've some question related to backward compatibility....

I'm seeing that most of them are compatible but some of them (please search for '#TODO' string) are not, f.e. change in multiplicity

I'm suspecting that a new ISO family should be created instead of enriching the existing one.

This implies a better iso 'guesser' which is doable but could be tricky and verbose!!! @amercader @fostermh what do you suggest?

Thank you!

fostermh commented 2 years ago

Unfortunately, the link you provided doesn't work for me.

As to your comments regarding backward compatibility, this was exactly my issue as well. The paths and multiplicity can likely be taken care of with relative ease. for example, if iso19115 expects multiple entries and iso19139 expected one we could adjust the value in an 'infer' function based on which standard was being harvested.

The primary issue I was having is the namespaces that need to change between standards. If we use the same standard for all datasets in a harvester (possibly across one ckan) then we should be able to adjust based on a harvester config setting or ckan.ini setting. If we want to support a mix of standards then it gets tricky. It has been a while since I looked at it, but I believe the namespaces are set during the first harvest creation. So if the first dataset being harvested is ISO19115 standard, the namespaces are set up to support this standard and all harvest objects after will use the same namespace. Obviously, that will not work if you are harvesting a mix of ISO19139 and ISO19115 xml files, for example.

ccancellieri commented 2 years ago

Unfortunately, the link you provided doesn't work for me. Strange, here you could find that file (with '#TODO' ) here also: https://github.com/ccancellieri/ckanext-spatial/blob/my_master/ckanext/spatial/model/harvested_metadata.py Also required some small change to base.py (most are [] replaced by .get()).

As to your comments regarding backward compatibility, this was exactly my issue as well. The paths and multiplicity can likely be taken care of with relative ease. for example, if iso19115 expects multiple entries and iso19139 expected one we could adjust the value in an 'infer' function based on which standard was being harvested.

Mmm this is tricky, I'm not sure I'm understanding. What I see is that the function guessing the iso needs some improvement but I'm not sure in which way to contribute, if we touch too many parts it would be very difficult to merge not having a deep test coverage.

What I'm thinking is for sure to

I would really appreciate to have support on any of these from anyone (also approvals or kind of blockers...).

The primary issue I was having is the namespaces that need to change between standards. If we use the same standard for all datasets in a harvester (possibly across one ckan) then we should be able to adjust based on a harvester config setting or ckan.ini setting.

Right I've also had some issue to download them with csw, now should be possible to configure an OutputSchema from the json which configure the harvester. (see below)

If we want to support a mix of standards then it gets tricky. It has been a while since I looked at it, but I believe the namespaces are set during the first harvest creation. So if the first dataset being harvested is ISO19115 standard, the namespaces are set up to support this standard and all harvest objects after will use the same namespace. Obviously, that will not work if you are harvesting a mix of ISO19139 and ISO19115 xml files, for example.

This is interesting, but can be handled (not easi but doable) with different harvesters configured as specified here: https://github.com/ckan/ckanext-spatial/pull/259

Thank you for your help!

FYI: My short term goal is to implement a set of harvester to import the following formats:

fostermh commented 2 years ago

ok I see TODO's and can lend a hand there. do you want me to do a PR to https://github.com/ccancellieri/ckanext-spatial or would you like to start a draft PR from https://github.com/ccancellieri/ckanext-spatial to https://github.com/ckan/ckanext-spatial and I can comment/push there? I think we talked about me starting a raw PR but it looks like you have most of it already and I would just need to remove/comment on a few bits.

ccancellieri commented 2 years ago

I added line by line comments #iso19115-3 as you did. I did that with blocks (separated by a space to the end. Then row by row, so it should be quite accurate and the conflicts can be commented/addressed hopefully one by one. Thanks!

FYI:

this branch is rebased to master and isolated to those 2 motels files but still requires #259 for multiple formats if you want to provide inline comments i can adjust them following your suggestions if you want to work from your ide please fork my branch and provide fixes on it via pull req, so we can merge on top of this

ccancellieri commented 2 years ago

The plugin looks far more complex than expected, I think:

We've to provide a new harvester implementation providing an ad-hoc ISOElement(s) hierarchy not touching base.py and harvested_metadata.py (which are dedicated to iso19139).

For this reason, please hold-on few days I'll try to produce a better integration and I'll give you the go to in a couple of days to pull if interested in contributing. A dedicated plugin can also be provided but would love to integrate it here to reduce thousands of duplicated rows.

I'll try to give both options

@fostermh stay tuned! :)

ccancellieri commented 2 years ago

Ok with the latest push (I apologize, I have had to rebase and squash to make it clear) we are ready to discuss:

There are several ticket/issues to open to be able to reuse code but I think most of them are summarized by:

https://github.com/ccancellieri/ckanext-spatial/blob/iso19115-3/ckanext/spatial/harvesters/iso19115/spatial_harvester.py https://github.com/ccancellieri/ckanext-spatial/blob/iso19115-3/ckanext/spatial/harvesters/iso19115/harvester.py

Now why we have 2 harvester?

The explanation is quite complex and mainly depends on several tickets that can be opened discussed (look at TODO).

The the main harvester is the iso19115_csw_harvester (harvester.py - which must be selected as harvester type into the configuration type), it implements IHarvester delegating as much as possible to the parent the tasks (CSWHarvester).

The iso19115 should be loaded as well (spatial_harvester.py - must be present into the plugins list) but it is mainly used to implement ISpatialHarvester to override some other method from base.py which is not so fault tolerant (most of this code may go if master will accept contributions.

As you see now we have a separate model: https://github.com/ccancellieri/ckanext-spatial/blob/iso19115-3/ckanext/spatial/harvesters/iso19115/model.py

Hope now we can remove all the 19139 parts from there, feel free to contribute on it, it should not be backward compatible anymore (but still may/should support all the 19115 versions).

Which is used if any of the validators validate the incoming metadata

otherwise other plugins will be used to test against the incoming metadata and the first match will be used.

This means that the plugin is able to harvest both is19115-(1,2,3) and iso19139 and the recognition is automatic.

@amercader I'm trying to develop following indications from ISpatialHarvester and IHarvester but looks like they are not so well integrated, should I open some tickets to fix them? I'll probably open a discussion since I feel there's still some missing but useful method to add to the ISpatialHarvester interface

@torfsen over to you on the model.py!

torfsen commented 2 years ago

@ccancellieri: I'm not sure what you meant when mentioning me ("over to you on the model.py"). Unfortunately I'm currently not able to contribute much time to CKAN, but if you have a specific question for me I'll see what I can do.

fostermh commented 2 years ago

Thanks @ccancellieri, I have been sidetracked by a few issues but will get back to this soon.

frafra commented 2 years ago

Is there anything that can be done to support this effort?