VLucet / rgovcan

Easy access to the Canadian Open Government Portal
https://vlucet.github.io/rgovcan/
22 stars 4 forks source link

`esri rest` metadata files #17

Open david-beauchesne opened 3 years ago

david-beauchesne commented 3 years ago

First, I want to thank and commend you for this package. It is already and will likely become even more of an invaluable resource to access Canadian data. Myself and @KevCaz have already begun to publicize its existence through workshops and it has already been a revelation for many workshop attendees!

I would have a question and a comment with regards to the resources downloaded using the package, and this is a conversation I already had with @KevCaz. The esri rest data files accompanying spatial data contain valuable metadata information on the downloaded data, information that I would think must accompany the data by default, not even as an option. Yet, unless I am missing something, the package does not support loading these data, instead returning the following message:

(esri rest) ⚠ skipped (not supported).

Is there a reason for not downloading the esri rest data, and would it be possible to modify the package so that the metadata are automatically downloaded, or at least that loading them is an option?

I do hope that I am not asking this question and that I will be directed back to the documentation where my question is answered in full. If that is the case, I humbly apologize!

If you can and want to modify the package to include the metadata, I would be more than happy to contribute if I can be of any assistance.

I think you in advance and I wish you a great day!

KevCaz commented 3 years ago

the esri rest data files accompanying spatial data contain valuable metadata information on the downloaded data, information that I would think must accompany the data by default, not even as an option.

I think ESRI REST are actually web geo-services, so we cannot really handle that, at least I don't see any easy option. Sometimes they look like metadata as you mentioned (e.g. https://maps-cartes.ec.gc.ca/arcgis/rest/services/CriticalHabitatAlanticHabitatEnPerilAtlantique/MapServer/22), sometimes they are maps, e.g. https://open.canada.ca/data/en/dataset/21db759b-9a74-43b2-9a5f-5749bb47c97d.

That being said there are lots of metadata returned when a package is accessed and so far, the package does not provide any features/utilities to access/display them, an advanced user playing with the package might realize that at some, but I guess we can do better.

KevCaz commented 3 years ago

I've spent some time looking at the metadata and they vary with packages (compare "b7ca71fa-6265-46e7-a73c-344ded9212b0" and "792bb73a-f758-4459-b7e9-0c286a0bc15d"). I guess there's a core of metadata that would always be available, but I don't know how to find them. To improve user experience, one way would be to provide package data even more formatted (we selected data we put in 1 or several datasets) and we offer the user to use the raw format (ckan_package). This is what is done with format_resources but I would go further.

VLucet commented 3 years ago

Hi @david-beauchesne , apologies for the long wait on this. Thank you for you comments. @KevCaz is right about ESRI stuff being geo services and I dont see any easy option for the metadata. It s a bit of a wild card as to what is actually contained in these things (as far as my knowledge goes). Sorry @KevCaz I dont follow what you mean in your last message. Would a good solution be to attempt to download the metadata and fail grafully if the content is a thing like a map?

david-beauchesne commented 3 years ago

Thank you for taking the time to answer my question @KevCaz and @VLucet!

I've also looked a bit more into the ESRI stuff and realized that's it's a bit more complex than I initially thought. It's a shame that the standard metadata presented on the web page do not seem to be available through the API. It would at least have given a way of standardizing some sort of metadata to accompany each loaded dataset through the package.

I see that each resource seems to have metadata files (xml, json). Are those files accessible through the API?

I sent an inquiry to open government to ask about metadata and data citation, I will let you know if they reply.

Thank you again for answering my question!

KevCaz commented 3 years ago

Hi @david-beauchesne and @VLucet ,

Sorry @KevCaz I dont follow what you mean in your last message. Would a good solution be to attempt to download the metadata and fail grafully if the content is a thing like a map?

My message wasn't very clear (to say the least). We cannot do much, may be just returned a message saying that this is a web service and return the URL. What I was saying is that there are various information in a package and I think what @david-beauchesne needs is in there. For instance

R> pid <- "792bb73a-f758-4459-b7e9-0c286a0bc15d"                                                                                                                                              

R> pkg <- govcan_get_record(pid, format_resources = TRUE)                                                                                                                                     
ℹ Searching for dataset with id:  792bb73a-f758-4459-b7e9-0c286a0bc15d
ℹ Record found: "Canada's War Dead - Honour Roll"

R> names(pkg)                                                                                                                                                                                 
 [1] "notes_translated"                 "imso_approval"                    "maintainer"                       "creator"                          "association_type"                
 [6] "org_section"                      "jurisdiction"                     "private"                          "maintainer_email"                 "num_tags"                        
[11] "contributor"                      "frequency"                        "keywords"                         "data_series_issue_identification" "ready_to_publish"                
[16] "id"                               "metadata_created"                 "subject"                          "relationships_as_object"          "owner_org"                       
[21] "spatial_representation_type"      "time_period_coverage_start"       "metadata_modified"                "author"                           "author_email"                    
[26] "geographic_region"                "position_name"                    "tags"                             "digital_object_identifier"        "state"                           
[31] "version"                          "spatial"                          "creator_user_id"                  "title_translated"                 "type"                            
[36] "resources"                        "place_of_publication"             "num_resources"                    "topic_category"                   "restrictions"                    
[41] "title"                            "collection"                       "org_title_at_publication"         "date_published"                   "relationships_as_subject"        
[46] "display_flags"                    "groups"                           "license_id"                       "data_series_name"                 "revision_id"                     
[51] "portal_release_date"              "name"                             "isopen"                           "date_modified"                    "url"                             
[56] "notes"                            "license_title"                    "audience"                         "license_url"                      "program_page_url"                
[61] "organization"                     "metadata_contact"                

R> pkg$license_id                                                                                                                                                                             
[1] "ca-ogl-lgo"

R> pkg$license_title                                                                                                                                                                          
[1] "Open Government Licence - Canada"

R> pkg$license_url                                                                                                                                                                            
[1] "http://open.canada.ca/en/open-government-licence-canada"

My point was that we should try to organize these in a better way (and that may prove tricky as the information available vary among resources).

david-beauchesne commented 3 years ago

Thank you @KevCaz for the clarification. We could indeed make use of these to get the required information for some sort of metadata for loaded resources.

I also received an answer from open canada:

Thanks for your question about metadata and using the Open Data Portal API within R.

Personally I’ve been using the CRAN - Package ckanr (r-project.org) package within R for some of the work I do here on the team. This works great for accessing the metadata in JSON, but does not expose the metadata in XML.

The Open Government Portal uses a commonly used software called CKAN as the data catalogue software. This is used by a number of open data portals worldwide such as data.gov, data.gov.uk, etc. There is detailed API documentation available for CKAN at API guide — CKAN 2.8.8 documentation

Here is an example in R of a simple script to get the metadata on a dataset vs. a resource:

library(ckanr)
ckanr_setup(url="https://open.canada.ca/data")
list_of_datasets<-package_list(limit=10)
dataset1<-package_show(list_of_datasets[[1]])
resource1_id<-dataset1$resources[[1]]$id
resource1<-resource_show(resource1_id)
print(dataset1)
print(resource1)

To generate the XML metadata that is linked to on each dataset we use the CKAN-DCAT extension - ckan/ckanext-dcat: CKAN ♥ DCAT (github.com)

From that github documentation, this extension generates a URL like https://{ckan-instance-host}/dataset/{dataset-id}.{format}

So for the case of the Open Government Portal the URL would be https://open.canada.ca/data/dataset/{id}.xml or https://open.canada.ca/data/dataset/719955f2-bf8e-44f7-bc26-6bd623e82884.xml for example.

This XML RDF metadata is at the dataset level but it does contains some information about the resources. if you were to use the resource_show API call, you would see there are more pieces of metadata that exist at the resource level which aren’t available at the dataset level in the RDF.

I'm not clear whether this answer brings us closer to a solution for metadata, but it does provide a solution to access the XML files associated with datasets. I did not dive into your code, however, so I am unsure whether this solution would play nicely with your code.