Closed thejuliekramer closed 4 years ago
Starting TDD
isPartOf is only used for DCAT-US sources. We manage the collection relationship in different ways for each source type.
[x] = upstreamed
DataType | Collection notes | Harvester | Extension |
---|---|---|---|
Datajson | Used at import stage here | DataJsonHarvester inherits from DatasetHarvesterBase |
datajson |
Geo-dataportal | A Harvester for CSW servers, with customizations for data.gov) | Inherit from CSWHarvester and GeoDataGovHarvester |
geodatagov |
WAF [x] | We use collection_package_id at fork and upstream. Waf harvester here. It's also used at upstream |
WAFHarvester inherits from SpatialHarvester |
spatial |
WAF-collections | We use collection_package_id at WAFCollectionHarvester.get_package_dict , here |
WAFCollectionHarvester inherit from GeoDataGovWAFHarvester which inherit from WAFHarvester and GeoDataGovHarvester |
spatial |
CSW [x] | CSWHarvester : At the fork, we have a command in which we add the collecion_package_id extra. This command exists at upstream but didn't add the extra. At load pycsw here. It's an internal command. It doesn't exists upstream (for CSW) |
GeoDataGovCSWHarvester inherit from CSWHarvester and GeoDataGovHarvester |
spatial |
Z3950 | It's covered by parent classes | Z3950Harvester inherit from GeoDataGovHarvester -> SpatialHarvester |
geodatagov |
ArcGIS | It's covered by parent classes | ArcGISHarvester inherit from SpatialHarvester |
geodatagov |
Doc | It's covered by parent classes | GeoDataGovDocHarvester inherit from DocHarvester and GeoDataGovHarvester |
geodatagov |
Aaron's Harvest Source Report
source_type | total_datasets | count |
---|---|---|
waf-collection | 781731 | 379 |
datajson | 72583 | 150 |
waf | 32333 | 466 |
ckan | 26021 | 2 |
csw | 398 | 7 |
z3950 | 177 | 3 |
single-doc | 3 | 16 |
geoportal | 0 | 5 |
arcgis | 0 | 5 |
Grand Total | 913246 | 1033 |
Created separate issue for dataset count N+1 bug https://github.com/GSA/datagov-ckan-multi/issues/337
Duplicate. closing
User Story
As a data.gov developer I want to isolate the collections feature related to CKAN search so that we can upgrade to CKAN 2.8.
Acceptance Criteria
[ ] WHEN I search for a parent dataset (https://catalog.data.gov/dataset/general-schedule-and-locality-pay) THEN I see it in search results
[ ] When I search for a child dataset (https://catalog.data.gov/dataset/general-schedule-and-locality-pay-1991) THEN I do not see the child dataset in the search results, but I do see the parent dataset in the results (https://catalog.data.gov/dataset/general-schedule-and-locality-pay)
[ ] WHEN I click on a collection THEN I can see all datasets that belong to that collection AND I can search within a collection for datasets that belong to that collection
Create tickets for below:
Task-list local dev
isPartOf
(ie check which metadata fields are saved by the harvester in order to determine parent/child relationship for collections - collection_package_id & collection_metatdata)Task-list sandbox
Once we have a functioning Catalog app running on CKAN 2.8 with the following extensions we can do final UAT testing