Closed krisstanton closed 3 months ago
@krisstanton, you can simply use the CMR Search API (HTTP).
For example, to find a granule in a collection, use this:
curl -s "https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=WV03_MSI_L1B&version=1&granule_ur=WV03_20160111181513_1040010017C40C00_16JAN11181513-M1BS-505548771010_01_P008"
You'll get a JSON response with a "hits" keyword, indicating how many granules in the collection with that granule id were found. You should get either 0 (not found) or 1 (found).
To spit out only the value of "hits":
curl -s "https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=WV03_MSI_L1B&version=1&granule_ur=WV03_20160111181513_1040010017C40C00_16JAN11181513-M1BS-505548771010_01_P008" | jq .hits
Is this the process that we should be looking at before deleting from old NGAP account? Yes :)
Update: More detailed steps from Ticket #347
csdap
means CBA, csda
means old NGAP - Helper to API Docs https://cmr.earthdata.nasa.gov/search/site/docs/search/api.htmlWIP Update:
Been doing some dev work around making these GET requests.
It looks like these params are either not working properly on the cmr search, or somewhere there is data missing that makes these fields searchable: temporal
and sort_key
by start_date
.
Be more specific. What exactly are you trying (exact requests) and what problems are you having?
Hey Chuck, thanks for offering to help. I saw your message after I had already resolved the issue.
I'm going to use the reply to your message as an update to this ticket as well.
In short here is what I needed to do and how I solved it.
I need to know which files are fully published to Earthdata on the new CBA accounts.
One way to do this is to query the CMR (as described earlier) in order to get (and page through) the granules.
Note: In order to differentiate between OLD NGAP
and NEW CBA
, I'm examining all of the s3 URLs on each granule. This serves as a way to confirm that any given file seen, which is sourced form the NEW CBA
s3 URL has been published. I'm essentially composing a list of all of those files.
The first problem I was running into was that just doing a collection search like this:
https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497404794-CSDA&page_num=1&page_size=2000
Will only allow a search depth of 1 million granules. So when we have a collection with more than one million granules, we end up with a kind of end of readable data on page 500 (when the page size is maxed out to 2000 as in the example above).
The second problem I ran into was with trying to come up with sub searches that would actually break the collection dataset into smaller pieces (less than 1 million granules). I had tried to use temporal search which did not seem to change the results at all. I tried a few other field search methods with no success as well.
Finally what worked was to break up the search by geographic region.
So if we look at the example of WV03_Pan_L1B
, running this query
https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&page_num=1&page_size=5
should give "hits":2528045
, which means 2,528,045 total granule results.
When adding the bounding box parameter and breaking this up into 4 separate 'searches', then I can get all of the granules because each part is less than 1 million granules. In this case, the bounding box is simple, only 4 quadrants of earth are needed. In other datasets I had to break the box into smaller pieces (more 'searches' to achieve this but it works just as well)!
Here are examples of a funcitoning get
request URL which gives me the data we need and returns under 1 millino granules per search for WV03_Pan_L1B
.
SouthWest "-180,-90,0,0" "hits":276577 https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&bounding_box=-180,-90,0,0&page_num=1&page_size=2000
NorthWest "-180,0,0,90" "hits":764563 https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&bounding_box=-180,0,0,90&page_num=1&page_size=2000
SouthEast "0,-90,180,0" "hits":596318 https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&bounding_box=0,-90,180,0&page_num=1&page_size=2000
NorthEast "0,0,180,90" "hits":896207 https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&bounding_box=0,0,180,90&page_num=1&page_size=2000
I noticed that there are an occasional duplicate granule (and list of files) getting captured this way, so I have another routine which recombines all these URLs and uses a set()
to force them to be unique on a per dataset basis.
Closing this ticket now, even though I'm still in the process of pulling the data down, the scope of this ticket which was coming up with a solution for this problem is now complete.
@krisstanton, for deep paging of CMR searches, you should not use the standard paging mechanism, which not only has the limitations that you mentioned, but also negatively impacts performance for all CMR searches by all users occurring concurrently with your search.
Instead, you should use the Search After mechanism.
@krisstanton, here's code for using "Search After":
import requests
from itertools import chain, islice
def find_granules(**params):
def pages(search_after):
if search_after:
r = requests.get(
url, params=params, headers={"cmr-search-after": search_after}
)
r.raise_for_status()
return r.json()["items"], r.headers.get("cmr-search-after")
return None
def granules(first_page, search_after):
yield from first_page
yield from chain.from_iterable(unfold(pages, search_after))
url = "https://cmr.earthdata.nasa.gov/search/granules.umm_json"
r = requests.get(url, params=params)
r.raise_for_status()
params = {**params, "page_size": 2_000}
return (
r.headers["cmr-hits"],
granules(r.json()["items"], r.headers["cmr-search-after"]),
)
def unfold(f, x):
while (t := f(x)):
y, x = t
yield y
if __name__ == "__main__":
# granules is a generator, not a list, so this will not explode
hits, granules = find_granules(short_name="WV03_MSI_L1B", version="1")
print(f"hits: {hits}")
# Each granule is a UMM item like each element in the items list in this example:
# https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#umm-json
# Since the collection can be huge, we're just printing the first 20 granule IDs.
for granule in islice(granules, 20):
# Do whatever you like with each granule here (call a function with each)
print(granule["umm"]["GranuleUR"])
If you want to process all granules in a collection, use for granule in granules:
instead of for granule in islice(granules, 20):
.
Find a programmatic way to independently verify (with Earthdata) that a granule was in-fact ingested. The problem we have is that a cumulus bug may have caused some conditions where granules will exist in the endpoint bucket but in-fact NOT be published in earthdata. We need to ensure these granules (and files) are NOT deleted from MCP while doing the final steps of the migration ingests (after we use the manifests and inventories)
Possible place to start: