NASA-Openscapes / earthdata-cloud-cookbook

A tutorial book of workflows for research using NASA EarthData in the Cloud created by the NASA-Openscapes team
https://nasa-openscapes.github.io/earthdata-cloud-cookbook
Other
85 stars 30 forks source link

earthaccess and R - Jan 19 summary & next steps #161

Open jules32 opened 1 year ago

jules32 commented 1 year ago

At today's coworking @BriannaLind @betolink @andypbarrett and I reviewed earthaccess and how to use it with R via reticulate, and next steps forward. The following are notes we can turn into concrete "todo" issues, here in the cookbook, and for earthaccess. Some recent background: #158

A big point that we came to through this conversation about Sarah Murphy's reticulate+xarray blog post: it's ok that the R code is running python code. The R syntax feels friendly to an R user, they aren't immediately concerned/aware that this is "just" python code presented as R. They are hoping to do their science using the tool they know (here, R). (in fact, it's more than ok, it's great, and no need to rewrite in R to be able to use the awesomeness of xarray and help awesome R users at the same time!)

Context

Our current R code (https://nasa-openscapes.github.io/earthdata-cloud-cookbook/how-tos/find-data/programmatic.html) works 🥳 :

## load R libraries
library(tidyverse) # install.packages("tidyverse") 
library(reticulate) # install.packages("reticulate")

## load python library
earthaccess <- reticulate::import("earthaccess") 

# use earthaccess to access data 
granules <- earthaccess$search_data(
  concept_id = "C2036880672-POCLOUD",
  temporal = reticulate::tuple("2017-01", "2017-02") # with an earthaccess update, this can be simply c() or list()
)

The granules object is a list; these are JSON dictionaries with some extra dictionaries. granules[1] returns the following - with the big question "what do we do from here?"

[[1]]
{'meta': {'concept-type': 'granule', 'concept-id': 'G2067294881-POCLOUD', 'revision-id': 3, 'native-id': 'ssh_grids_v1812_2017010412', 'provider-id': 'POCLOUD', 'format': 'application/vnd.nasa.cmr.umm+json', 'revision-date': '2022-05-03T23:57:41.054Z'}, 'umm': {'TemporalExtent': {'RangeDateTime': {'EndingDateTime': '2017-01-04T00:00:00.000Z', 'BeginningDateTime': '2017-01-04T00:00:00.000Z'}}, 'MetadataSpecification': {'URL': 'https://cdn.earthdata.nasa.gov/umm/granule/v1.6.4', 'Name': 'UMM-G', 'Version': '1.6.4'}, 'GranuleUR': 'ssh_grids_v1812_2017010412', 'ProviderDates': [{'Type': 'Insert', 'Date': '2021-06-11T19:06:25.572Z'}, {'Type': 'Update', 'Date': '2021-06-11T19:06:25.572Z'}], 'SpatialExtent': {'HorizontalSpatialDomain': {'Geometry': {'BoundingRectangles': [{'WestBoundingCoordinate': 0.083, 'SouthBoundingCoordinate': -79.917, 'EastBoundingCoordinate': 180, 'NorthBoundingCoordinate': 79.917}, {'WestBoundingCoordinate': -180, 'SouthBoundingCoordinate': -79.917, 'EastBoundingCoordinate': -0.083, 'NorthBoundingCoordinate': 79.917}]}}}, 'DataGranule': {'ArchiveAndDistributionInformation': [{'SizeUnit': 'MB', 'Size': 6.008148193359375e-05, 'Checksum': {'Value': 'e2741b0693626dd2984f4683f6142ef4', 'Algorithm': 'MD5'}, 'SizeInBytes': 63, 'Name': 'ssh_grids_v1812_2017010412.nc.md5'}, {'SizeUnit': 'MB', 'Size': 15.858612060546875, 'Checksum': {'Value': '2dd414d36ef9f077e5d8756565723d55', 'Algorithm': 'MD5'}, 'SizeInBytes': 16628960, 'Name': 'ssh_grids_v1812_2017010412.nc'}], 'DayNightFlag': 'Unspecified', 'ProductionDateTime': '2019-02-11T20:50:05.305Z'}, 'CollectionReference': {'Version': '1812', 'ShortName': 'SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL1812'}, 'RelatedUrls': [{'URL': 's3://podaac-ops-cumulus-protected/SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL1812/ssh_grids_v1812_2017010412.nc', 'Type': 'GET DATA VIA DIRECT ACCESS', 'Description': 'This link provides direct download access via S3 to the granule.'}, {'URL': 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL1812/ssh_grids_v1812_2017010412.nc', 'Description': 'Download ssh_grids_v1812_2017010412.nc', 'Type': 'GET DATA'}, {'URL': 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL1812/ssh_grids_v1812_2017010412.nc.md5', 'Description': 'Download ssh_grids_v1812_2017010412.nc.md5', 'Type': 'EXTENDED METADATA'}, {'URL': 'https://archive.podaac.earthdata.nasa.gov/s3credentials', 'Description': 'api endpoint to retrieve temporary credentials valid for same-region direct s3 access', 'Type': 'VIEW RELATED INFORMATION'}]}, 'size': 15.858672142028809}

Ideas for next steps

When we run python functions using reticulate, the syntax is with $, for example: library$function

earthaccess$open

The example above uses code from Luis' AGU poster:

image

It would be awesome for this code to work so we could open a granule and look at it.

y <- earthaccess$open(list(granules[0])) ## testing with Luis and Bri

Then...

xarray

Building from above, really we'd like to not only open the file, but be able to use xarray to stream the data (so we don't have to download to look at it). This is possible! This awesome post: https://cougrstats.netlify.app/post/2021-04-21-using-python-in-r-studio-with-reticulate/ by Sarah Murphy.

So, we'd be able to run this code (copied from L5 of Luis' poster) with the R/reticulate syntax from the blog post:

xr <- reticulate::import('xarray')
ds <- xr$open_mfdataset(earthaccess$open(granules))

earthaccess$download

It would be great if we could do this from earthaccess:

earthaccess$download(granules, "Desktop") ## this doesn't work; we want a thin wrapper. In the meantime, like Bri said: metadata in CMR JSON; we need to write code in R to get datatype. list of https links. Then skip the download and just use R

earthaccess$data_links(granules[0])

This idea works in python: granules[0]$data_links

It would be great if this could work in R:

earthaccess$data_links(granules[0]) ## dream code! 

Then would need to open or download NetCDF. earthaccess does this for python; could reticulate let us use earthaccess from R, or would we need to find R-native NetCDF approaches. Would have to download. (Another further step would be streaming them with xarray - this would be the https links) https://github.com/ropensci/tidync > netcdfs in R! https://pjbartlein.github.io/REarthSysSci/netCDF.html#reading-restructuring-and-writing-netcdf-files-in-r

Bri's awesome R CMR Trial:

We ran Bri's code together, which is awesome and works on the staging hub with Julie's permissions. Exciting to combine these steps with earthaccess (hover over code chunk top-right to copy!)

# Load libraries 
library(httr)
library(jsonlite)
library(dplyr)
library(data.table)

# Define parameters to be used in request; must be defined in a list format
cmrURL      <- 'https://cmr.earthdata.nasa.gov/search/granules.umm_json'   # CMR API endpoint url
parameters  <- list(concept_id='C2021957657-LPCLOUD',                     # HLS
                    concept_id='C2021957295-LPCLOUD',                     # HLS 
                    temporal='2021-10-17T00:00:00Z,2021-10-19T23:59:59Z')                                     # page size limit

# Submit GET request, put retrieved list into getResponse.ls
getResponse.ls <- httr::GET(url=cmrURL, query=parameters)
cat('This request returned',getResponse.ls$header$`cmr-hits`,'granuale hits, in', as.integer(as.numeric(getResponse.ls$header$`cmr-hits`)/2000)+1, 'pages of results, and a',getResponse.ls$status_code,"status code with", getResponse.ls$header$`content-type`, "content.", getResponse.ls$description,sep=" ")

# extract content from getResponse.ls and isolate granule URLs for a single page
Content.ls <- fromJSON(content(getResponse.ls, as="text")) # Convert content received from request to workable format
RelatedURLs.ls <- Content.ls$items$umm$RelatedUrls         # Go to component of list that has URLs of interest
filteredURLs <- function(x){                    
                            https_urls <- dplyr::filter(x, Type=='GET DATA')
                           }  # Define function (filterURLs) to retrieve URLs based on value of "Type" key.  
# If Type = "GET DATA", put url 
filtered.ls <- lapply(RelatedURLs.ls, filteredURLs) # Apply function to list and combine all rows into a dataframe
granules.df <- do.call(rbind, filtered.ls)          # combine all rows of list into a single dataframe 

# extract content from getResponse.ls and isolate granule URLs for requests that have multiple pages of responses
# hits <- as.numeric(getResponse.ls$header$`cmr-hits`)    # Define the number of hits in GET request
# page_size <- 2000                                       # Define page size 
# page_numbers.ls <- list(seq(1,((hits/page_size)+1),1))  # Make a list of page numbers
# page_numbers.ls

# For each page of results (1:12) perform get request and add filtered URLS to LIST
LIST = list()
for (n in 1:1){
  print(n)
  cmrURL      <- 'https://cmr.earthdata.nasa.gov/search/granules.umm_json' 
  getResponse.ls <- httr::GET(url=cmrURL, query=list(concept_id='C2021957657-LPCLOUD',                     
                                                     concept_id='C2021957295-LPCLOUD',                      
                                                     temporal='2021-10-17T00:00:00Z,2021-10-19T23:59:59Z', 
                                                     page_size= '2000',
                                                     page_num=n))
  Content.ls <- fromJSON(content(getResponse.ls, as="text")) 
  RelatedURLs.ls  <- Content.ls$items$umm$RelatedUrls 
  LIST[[n]]  <- lapply(RelatedURLs.ls, filteredURLs)
} 

# ECTRACT URLS from LIST into "completeURLlist"
x <- (unlist(LIST))
x <- as.data.frame(x)
completeURLlist <- as.data.frame(x[x$x %like% "https", ])  

#### Notes on asynchronous requests:
#### httr is not capable of asynchronous requests
#### need to use either async, crul, or curl
#### with reformatted request parameters

#### Check these links to get started
#### https://docs.ropensci.org/crul/articles/how-to-use-crul.html

@BriannaLind , @betolink , @andypbarrett , others - please expand ideas here or in linked issues as we tackle these going forward! 🎉

R geospatial resources