GSA / datagov-ckan-multi

Other
10 stars 6 forks source link

package_search over collection member datasets shows only 27k #520

Closed adborden closed 3 years ago

adborden commented 4 years ago

catalog-classic shows ~4M

$ curl -s 'https://catalog.data.gov/api/action/package_search?fq=collection_package_id:*&rows=0' | jq .result.count
4190268

catalog-next shows 27K

$ curl -s 'https://catalog-next.data.gov/api/action/package_search?fq=collection_package_id:*&rows=0' | jq .result.count
27786

How to reproduce

  1. https://catalog-next.data.gov/api/action/package_search?fq=collection_package_id:*&rows=0

Expected behavior

Shows all datasets, including collection members.

Actual behavior

Showing only 27k datasets.

thejuliekramer commented 3 years ago
curl -s 'https://catalog.sandbox.datagov.us/api/action/package_search?fq=collection_package_id:*&rows=0' | jq .result.count
155
curl -s 'https://catalog-next.sandbox.datagov.us/api/action/package_search?fq=collection_package_id:*&rows=0' | jq .result.count
1403
thejuliekramer commented 3 years ago

@adborden I am wondering what we want the final result of this query to be - this query ignores datasets that are not part of a collection - do we want to get all datasets including children or just datasets that are part of a collection?

thejuliekramer commented 3 years ago

Testing this query locally I can see that we are counting the number of children dataset of collection sources correctly but we are not counting non-collection datasets at all.

There are two harvest sources here one is not a collection and the other is a collection.

Screen Shot 2020-12-15 at 10 34 06 AM

The non collection harvest source does not get counted by this query because the datasets don't have a collection package id.

The collection harvest source does get counted - but only the children because only the children have a collection_package_id

Screen Shot 2020-12-15 at 10 33 49 AM
thejuliekramer commented 3 years ago

Is the purpose of this query to count all datasets including non collection datasets? If so we need to update this to fetch this number in a different way than using the collection_package_id - but I think the main discrepancy in the numbers is just that there are a lot older collection datasets that are no longer harvesting. The new amount this query returns is 95k

Screen Shot 2020-12-15 at 10 45 01 AM
adborden commented 3 years ago

@thejuliekramer this query should show you all the collections and their counts.

$ curl 'https://catalog.data.gov/api/action/package_search?fq=collection_package_id:*&facet.field=\["collection_package_id"\]&facet.limit=-1&facet.mincount=2&rows=0'

I could not get the facet.sort parameter to work 🤷

thejuliekramer commented 3 years ago

This dataset has almost 2 million children - I am trying to see if we can harvest the harvest source this is from but I am having trouble accessing it

dataset: https://catalog.data.gov/dataset?collection_package_id=a3be5059-5495-4f95-9a14-b03deb9eade0

harvest source: https://catalog.data.gov/harvest/79036357-a26a-4bd9-892b-f2ae110322c3

Screen Shot 2020-12-16 at 12 00 58 PM
thejuliekramer commented 3 years ago

Here is the list of datasets with the biggest collection - id is the parent dataset id list biggest collections.txt

avdata99 commented 3 years ago

Other big collections

Collection Children WEB Children DB Harvest Source
Lidar Point Cloud - USGS National Map 3DEP Downloadable Data Collection 1,139,371 1,126,190 USGS Lidar Point Cloud LAS Harvest Source
USGS public distribution of FSA 10:1 NAIP Imagery Downloadable Data Collection from The National Map 218,702 368,310 FSA 10:1 NAIP Imagery Collection
PKG ID 949b7c71-6913-45f1-85b9-2a4a3dd6e2bc (maybe deleted or not published) ? 352,778
USGS Historical Topographic Map Collection 84 341,816 USGS Historical Topographic Maps
thejuliekramer commented 3 years ago

I can't get to the harvest source links in catalog to get the URL's and I also don't see them in the spreadsheet we've been using to track progress. @hkdctol are you able to reach out to get update harvest source urls for these big ones?

thejuliekramer commented 3 years ago

@adborden I think this ticket is done for now we are counting the number correctly we just no longer have access to these big sources. If we get new harvest source info we can add them to catalog next

thejuliekramer commented 3 years ago

Added the information here to this issue so we don't lose track https://github.com/GSA/datagov-ckan-multi/issues/510

hkdctol commented 3 years ago

@thejuliekramer the Original Product Resolution one is confirmed by USGS. Tracking down the details on the others.

thejuliekramer commented 3 years ago

Harvest source name - usgs-ned-original-product-resolution-opr-downloadable-data-collection

thejuliekramer commented 3 years ago

@hkdctol I don't see the original product resolution one in this spreadsheet - did they confirm it via email?

thejuliekramer commented 3 years ago

I updated the main Harvest Source Progress sheet with these large harvest sources that are causing the discrepancy - they have not been imported yet because I am getting a timeout from catalog when trying to import that harvest source - we will manually add these harvest sources when we are able to log in to production catalog-next