Find a programmatic way to independently verify (with Earthdata) that a granule was infact ingested

krisstanton commented 5 months ago

Find a programmatic way to independently verify (with Earthdata) that a granule was in-fact ingested. The problem we have is that a cumulus bug may have caused some conditions where granules will exist in the endpoint bucket but in-fact NOT be published in earthdata. We need to ensure these granules (and files) are NOT deleted from MCP while doing the final steps of the migration ingests (after we use the manifests and inventories)

Possible place to start:

[x] Find out if Earth Data has an API that will give us a list of ingested granules
- Perhaps that API can tell us if an individual, specific granule exists (with a true/false type return)
[x] Write a solution to this problem
[x] Find a good way to implement this solution into the workflow that goes with the manifest processing

chuckwondo commented 5 months ago

@krisstanton, you can simply use the CMR Search API (HTTP).

For example, to find a granule in a collection, use this:

curl -s "https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=WV03_MSI_L1B&version=1&granule_ur=WV03_20160111181513_1040010017C40C00_16JAN11181513-M1BS-505548771010_01_P008"

You'll get a JSON response with a "hits" keyword, indicating how many granules in the collection with that granule id were found. You should get either 0 (not found) or 1 (found).

To spit out only the value of "hits":

curl -s "https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=WV03_MSI_L1B&version=1&granule_ur=WV03_20160111181513_1040010017C40C00_16JAN11181513-M1BS-505548771010_01_P008" | jq .hits

hbparache commented 3 months ago

Is this the process that we should be looking at before deleting from old NGAP account? Yes :)

krisstanton commented 3 months ago

Update: More detailed steps from Ticket #347

- [x] Check CMR for existence of granules (make the master list from this query)
- [x] Handle Pagination on the Query
- [x] Output will be granule level. Note that we will be getting mixed out from CMR (some granules from old NGAP and some from new CBA), the way to tell the difference is to examine their DNS pointer address on the file download csdap means CBA, csda means old NGAP - Helper to API Docs https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

krisstanton commented 3 months ago

WIP Update: Been doing some dev work around making these GET requests. It looks like these params are either not working properly on the cmr search, or somewhere there is data missing that makes these fields searchable: temporal and sort_key by start_date.

chuckwondo commented 3 months ago

Be more specific. What exactly are you trying (exact requests) and what problems are you having?

krisstanton commented 3 months ago

Hey Chuck, thanks for offering to help. I saw your message after I had already resolved the issue.

I'm going to use the reply to your message as an update to this ticket as well.

In short here is what I needed to do and how I solved it.

I need to know which files are fully published to Earthdata on the new CBA accounts. One way to do this is to query the CMR (as described earlier) in order to get (and page through) the granules. Note: In order to differentiate between OLD NGAP and NEW CBA, I'm examining all of the s3 URLs on each granule. This serves as a way to confirm that any given file seen, which is sourced form the NEW CBA s3 URL has been published. I'm essentially composing a list of all of those files. The first problem I was running into was that just doing a collection search like this:

https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497404794-CSDA&page_num=1&page_size=2000

Will only allow a search depth of 1 million granules. So when we have a collection with more than one million granules, we end up with a kind of end of readable data on page 500 (when the page size is maxed out to 2000 as in the example above).

The second problem I ran into was with trying to come up with sub searches that would actually break the collection dataset into smaller pieces (less than 1 million granules). I had tried to use temporal search which did not seem to change the results at all. I tried a few other field search methods with no success as well.

Finally what worked was to break up the search by geographic region.

So if we look at the example of WV03_Pan_L1B, running this query

https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&page_num=1&page_size=5

should give "hits":2528045, which means 2,528,045 total granule results.

When adding the bounding box parameter and breaking this up into 4 separate 'searches', then I can get all of the granules because each part is less than 1 million granules. In this case, the bounding box is simple, only 4 quadrants of earth are needed. In other datasets I had to break the box into smaller pieces (more 'searches' to achieve this but it works just as well)!

Here are examples of a funcitoning get request URL which gives me the data we need and returns under 1 millino granules per search for WV03_Pan_L1B.

    SouthWest "-180,-90,0,0"    "hits":276577   https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&bounding_box=-180,-90,0,0&page_num=1&page_size=2000
    NorthWest "-180,0,0,90"     "hits":764563   https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&bounding_box=-180,0,0,90&page_num=1&page_size=2000
    SouthEast "0,-90,180,0"     "hits":596318   https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&bounding_box=0,-90,180,0&page_num=1&page_size=2000
    NorthEast "0,0,180,90"      "hits":896207   https://cmr.earthdata.nasa.gov/search/granules.umm_json?collection_concept_id=C2497431983-CSDA&bounding_box=0,0,180,90&page_num=1&page_size=2000

I noticed that there are an occasional duplicate granule (and list of files) getting captured this way, so I have another routine which recombines all these URLs and uses a set() to force them to be unique on a per dataset basis.

Closing this ticket now, even though I'm still in the process of pulling the data down, the scope of this ticket which was coming up with a solution for this problem is now complete.

chuckwondo commented 3 months ago

@krisstanton, for deep paging of CMR searches, you should not use the standard paging mechanism, which not only has the limitations that you mentioned, but also negatively impacts performance for all CMR searches by all users occurring concurrently with your search.

Instead, you should use the Search After mechanism.

chuckwondo commented 3 months ago

@krisstanton, here's code for using "Search After":

import requests
from itertools import chain, islice

def find_granules(**params):
    def pages(search_after):
        if search_after:
            r = requests.get(
                url, params=params, headers={"cmr-search-after": search_after}
            )
            r.raise_for_status()
            return r.json()["items"], r.headers.get("cmr-search-after")

        return None

    def granules(first_page, search_after):
        yield from first_page
        yield from chain.from_iterable(unfold(pages, search_after))

    url = "https://cmr.earthdata.nasa.gov/search/granules.umm_json"
    r = requests.get(url, params=params)
    r.raise_for_status()
    params = {**params, "page_size": 2_000}

    return (
        r.headers["cmr-hits"],
        granules(r.json()["items"], r.headers["cmr-search-after"]),
    )

def unfold(f, x):
    while (t := f(x)):
        y, x = t
        yield y

if __name__ == "__main__":
    # granules is a generator, not a list, so this will not explode
    hits, granules = find_granules(short_name="WV03_MSI_L1B", version="1")
    print(f"hits: {hits}")

    # Each granule is a UMM item like each element in the items list in this example:
    # https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#umm-json
    # Since the collection can be huge, we're just printing the first 20 granule IDs.
    for granule in islice(granules, 20):
        # Do whatever you like with each granule here (call a function with each)
        print(granule["umm"]["GranuleUR"])

If you want to process all granules in a collection, use for granule in granules: instead of for granule in islice(granules, 20):.

NASA-IMPACT / csdap-cumulus

Find a programmatic way to independently verify (with Earthdata) that a granule was infact ingested #341