Closed adborden closed 3 years ago
curl -s 'https://catalog.sandbox.datagov.us/api/action/package_search?fq=collection_package_id:*&rows=0' | jq .result.count
155
curl -s 'https://catalog-next.sandbox.datagov.us/api/action/package_search?fq=collection_package_id:*&rows=0' | jq .result.count
1403
@adborden I am wondering what we want the final result of this query to be - this query ignores datasets that are not part of a collection - do we want to get all datasets including children or just datasets that are part of a collection?
Testing this query locally I can see that we are counting the number of children dataset of collection sources correctly but we are not counting non-collection datasets at all.
There are two harvest sources here one is not a collection and the other is a collection.
The non collection harvest source does not get counted by this query because the datasets don't have a collection package id.
The collection harvest source does get counted - but only the children because only the children have a collection_package_id
Is the purpose of this query to count all datasets including non collection datasets? If so we need to update this to fetch this number in a different way than using the collection_package_id - but I think the main discrepancy in the numbers is just that there are a lot older collection datasets that are no longer harvesting. The new amount this query returns is 95k
@thejuliekramer this query should show you all the collections and their counts.
$ curl 'https://catalog.data.gov/api/action/package_search?fq=collection_package_id:*&facet.field=\["collection_package_id"\]&facet.limit=-1&facet.mincount=2&rows=0'
I could not get the facet.sort
parameter to work 🤷
This dataset has almost 2 million children - I am trying to see if we can harvest the harvest source this is from but I am having trouble accessing it
dataset: https://catalog.data.gov/dataset?collection_package_id=a3be5059-5495-4f95-9a14-b03deb9eade0
harvest source: https://catalog.data.gov/harvest/79036357-a26a-4bd9-892b-f2ae110322c3
Here is the list of datasets with the biggest collection - id is the parent dataset id list biggest collections.txt
Other big collections
Collection | Children WEB | Children DB | Harvest Source |
---|---|---|---|
Lidar Point Cloud - USGS National Map 3DEP Downloadable Data Collection | 1,139,371 | 1,126,190 | USGS Lidar Point Cloud LAS Harvest Source |
USGS public distribution of FSA 10:1 NAIP Imagery Downloadable Data Collection from The National Map | 218,702 | 368,310 | FSA 10:1 NAIP Imagery Collection |
PKG ID 949b7c71-6913-45f1-85b9-2a4a3dd6e2bc (maybe deleted or not published) | ? | 352,778 | |
USGS Historical Topographic Map Collection | 84 | 341,816 | USGS Historical Topographic Maps |
I can't get to the harvest source links in catalog to get the URL's and I also don't see them in the spreadsheet we've been using to track progress. @hkdctol are you able to reach out to get update harvest source urls for these big ones?
@adborden I think this ticket is done for now we are counting the number correctly we just no longer have access to these big sources. If we get new harvest source info we can add them to catalog next
Added the information here to this issue so we don't lose track https://github.com/GSA/datagov-ckan-multi/issues/510
@thejuliekramer the Original Product Resolution one is confirmed by USGS. Tracking down the details on the others.
Harvest source name - usgs-ned-original-product-resolution-opr-downloadable-data-collection
@hkdctol I don't see the original product resolution one in this spreadsheet - did they confirm it via email?
I updated the main Harvest Source Progress sheet with these large harvest sources that are causing the discrepancy - they have not been imported yet because I am getting a timeout from catalog when trying to import that harvest source - we will manually add these harvest sources when we are able to log in to production catalog-next
catalog-classic shows ~4M
catalog-next shows 27K
How to reproduce
Expected behavior
Shows all datasets, including collection members.
Actual behavior
Showing only 27k datasets.