Implement collections integration for CKAN search

thejuliekramer commented 4 years ago

User Story

As a data.gov developer I want to isolate the collections feature related to CKAN search so that we can upgrade to CKAN 2.8.

Acceptance Criteria

[ ] WHEN I search for a parent dataset (https://catalog.data.gov/dataset/general-schedule-and-locality-pay) THEN I see it in search results
[ ] When I search for a child dataset (https://catalog.data.gov/dataset/general-schedule-and-locality-pay-1991) THEN I do not see the child dataset in the search results, but I do see the parent dataset in the results (https://catalog.data.gov/dataset/general-schedule-and-locality-pay)
[ ] WHEN I click on a collection THEN I can see all datasets that belong to that collection AND I can search within a collection for datasets that belong to that collection

Task-list local dev

[x] Investigate the if/how the other harvesters use isPartOf (ie check which metadata fields are saved by the harvester in order to determine parent/child relationship for collections - collection_package_id & collection_metatdata)
- [x] Geo-dataportal (do this one first)
- [x] WAF
- [x] WAF-collections
- [x] CSW
- [x] pyZ3950
- [x] Arcgis
- [x] Doc
[x] Double check there is no dependency for the Harvester ext:
- [x] Yes, we have dependencies. The Data.json adds datajson_collection to source config to be used at the harvester ext to divide the harvest process first for parents and then for children. We already have the Catalog-next#40 issue covering this.
[x] Incorporate the "fq commit" that filters our child datasets from site-wide searches
[ ] Create unit tests
[ ] Create integration tests
[ ] Mini-PR to Aaron (if he wants to review tests in this TDD approach)
[x] Create the Collections code as per the analysis
- [x] Check that our assumption is validated -- that the data.json, theme, and geodatagov extensions are not affected (this may be picked up by tests, but this is crucial so putting as a separate task item)
[x] Look into the n+1 bug. Timebox this at 2 hours. If longer then create a separate issue for this.
[ ] Unit tests are green
[ ] Integration tests are green

Task-list sandbox

Once we have a functioning Catalog app running on CKAN 2.8 with the following extensions we can do final UAT testing

avdata99 commented 4 years ago

Starting TDD

avdata99 commented 4 years ago

isPartOf is only used for DCAT-US sources. We manage the collection relationship in different ways for each source type.

[x] = upstreamed

DataType	Collection notes	Harvester	Extension
Datajson	Used at import stage here	`DataJsonHarvester` inherits from `DatasetHarvesterBase`	datajson
Geo-dataportal	A Harvester for CSW servers, with customizations for data.gov)	Inherit from `CSWHarvester` and `GeoDataGovHarvester`	geodatagov
WAF [x]	We use `collection_package_id` at fork and upstream. Waf harvester here. It's also used at upstream	`WAFHarvester` inherits from `SpatialHarvester`	spatial
WAF-collections	We use `collection_package_id` at `WAFCollectionHarvester.get_package_dict`, here	`WAFCollectionHarvester` inherit from `GeoDataGovWAFHarvester` which inherit from `WAFHarvester` and `GeoDataGovHarvester`	spatial
CSW [x]	`CSWHarvester`: At the fork, we have a command in which we add the `collecion_package_id` extra. This command exists at upstream but didn't add the extra. At load pycsw here. It's an internal command. It doesn't exists upstream (for CSW)	`GeoDataGovCSWHarvester` inherit from `CSWHarvester` and `GeoDataGovHarvester`	spatial
Z3950	It's covered by parent classes	`Z3950Harvester` inherit from `GeoDataGovHarvester` -> `SpatialHarvester`	geodatagov
ArcGIS	It's covered by parent classes	`ArcGISHarvester` inherit from `SpatialHarvester`	geodatagov
Doc	It's covered by parent classes	`GeoDataGovDocHarvester` inherit from `DocHarvester` and `GeoDataGovHarvester`	geodatagov

Aaron's Harvest Source Report

thejuliekramer commented 4 years ago

kimwdavidson commented 4 years ago

Duplicate. closing