NASA-IMPACT / admg-casei

ADMG Inventory
https://impact.earthdata.nasa.gov/casei/
Apache License 2.0
1 stars 0 forks source link

How to efficiently download deployment data via API #581

Closed heidimok closed 8 months ago

heidimok commented 10 months ago

Context

For the epic https://github.com/NASA-IMPACT/admg-casei/issues/576, we are directly accessing and downloading the data for 5 campaigns via url and processing it in order to reduce the file size.

Technical Discovery

A way to access the data via API instead

praveenphatate commented 10 months ago

https://colab.research.google.com/drive/1h-axDa69rKxbB-o-OZPnmrfVU5aXauMx?authuser=1#scrollTo=Bbay8-Ov4le1

Jeaton1021 commented 10 months ago

Have this discussion during deepdive on Thursday @heidimok @smwingo

praveenphatate commented 10 months ago

We can fetch the deployments which contain date range by using campaign short_name

activate = Campaign.objects.get(short_name = "ACTIVATE")
deployments = Deployment.objects.all().filter(campaign = activate.uuid)

Here is an example output

for deployment in deployments:
  print(deployment.start_date, deployment.end_date, deployment.short_name)
 2021-05-13 2021-06-30 ACTIVATE_dep_2021b
 2021-11-30 2022-03-29 ACTIVATE_dep_2021c
 2020-08-13 2020-09-30 ACTIVATE_dep_2020b
 2022-05-03 2022-06-18 ACTIVATE_dep_2022
 2020-02-14 2020-03-12 ACTIVATE_dep_2020a
 2021-01-27 2021-04-02 ACTIVATE_dep_2021a

And, also FYI meta fields of deployment has flight_tracks which is empty currently

 <django.db.models.fields.UUIDField: uuid>,
 <django.db.models.fields.CharField: short_name>,
 <django.db.models.fields.CharField: long_name>,
 <django.db.models.fields.TextField: notes_internal>,
 <django.db.models.fields.TextField: notes_public>,
 <django.db.models.fields.related.ForeignKey: campaign>,
 <django.db.models.fields.DateField: start_date>,
 <django.db.models.fields.DateField: end_date>,
 <django.contrib.gis.db.models.fields.PolygonField: spatial_bounds>,
 <django.db.models.fields.TextField: study_region_map>,
 <django.db.models.fields.TextField: ground_sites_map>,
 <django.db.models.fields.TextField: flight_tracks>
heidimok commented 8 months ago

To close out the last PI, I'm going to be closing the epic. But since this issue feeds into the next PI, I'll link it to a new epic that relates to visualizing the rest of the available flight tracks in CASEI beyond just the 5 we prototyped here.

heidimok commented 8 months ago

Update Jan 18 - @praveenphatate to add documentation and close out before new PI

praveenphatate commented 8 months ago

Effieciently download Deployment data

Problem Statement

Solution

Since, CASEI backend already has the database where the required information can be found we can try and create a simple csv file to hold all that data Steps:

deployments = Deployment.objects.all().filter(campaign = camp.uuid)

col_prd = deployments[0].collection_periods.all()

dois = DOI.objects.filter(collection_periods=col_prd[0])

concept_ids = [doi.concept_id for doi in dois]

Filter Change objects based on concept IDs

drafts = Change.objects.filter( content_typemodel='doi', actionin=[Change.Actions.CREATE, Change.Actions.UPDATE], update__concept_id__in=concept_ids )

for drf in drafts: if drf.status == 6 and 'Meteorological and Navigational Data' in drf.update.get('cmr_entry_title', ''): print(drf.status, drf.action, drf.updated_at, drf.update['concept_id'], drf.update['cmr_entry_title']) OUT: 6 Create 2021-06-21 23:29:24.087000+00:00 C1954736081-LARC_ASDC CAMP2Ex P-3 In-Situ Meteorological and Navigational Data


This can be expanded to include the Campaigns and Deployments by using the concept_id or collection_id and deployment start_date and end_date.

## Option 1
Using the CMR json query to fetch all the location url for each deployment and then download all the .ict files or create a .yaml file containing all the .ict locations.

import requests import xml.etree.ElementTree as ET import json

This is the url link for ACTIVATE B-200 (King AIR) for time period 2020-02-14 00:00:00 to 2020-03-12 23:59:59

url = "https://cmr.earthdata.nasa.gov/search/granules?collection_concept_id=C1994460996-LARC_ASDC&platform=King%20Air&temporal[]=2020-02-14T00:00:00Z,2020-03-12T23:59:59Z&page_size=200" response = requests.get(url)

Check if the request was successful (status code 200)

if response.status_code == 200:

Parse the XML content

root = ET.fromstring(response.content)

# Extract location from XML data
locations = [reference.find("location").text for reference in root.findall(".//reference")]

Iterate through each location URL

for location_url in locations:

Make a request to the location URL

location_response = requests.get(location_url)

# Check if the location request was successful (status code 200)
if location_response.status_code == 200:
    # Parse the JSON content
    location_data = json.loads(location_response.text)

    # Extract relevant ict from the JSON dict response
    download_url = location_data.get("RelatedUrls", [{}])[0].get("URL", "")

    print(f"Download URL: {download_url}")
Sample Output

Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-Dropsonde_UC12_202002141729_R1.ict Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-Dropsonde_UC12_202002141747_R1.ict Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-Dropsonde_UC12_202002141826_R1.ict Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-Dropsonde_UC12_202002141925_R1.ict Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-METNAV_UC12_20200214_R0.ict Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-Dropsonde_UC12_202002151725_R1.ict Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-Dropsonde_UC12_202002151751_R1.ict Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-Dropsonde_UC12_202002151823_R1.ict Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-Dropsonde_UC12_202002151855_R1.ict Download URL: https://asdc.larc.nasa.gov/data/ACTIVATE/MetNav_AircraftInSitu_KingAir_Data_1/ACTIVATE-METNAV_UC12_20200215_R0.ict

Which can then filtered based on the keyword like METNAV or SUMMARY

## Option 2
Using the POST request to fetch and download all the .ict files or create a .yaml file containing all the .ict locations.

This particular POST request takes the "readable_granule_name" which takes the keyword like "ACTIVATE-SUMMARY" or "ACTIVATE-METNAV"

body = {'params': {'concept_id': [], 'echo_collection_id': 'C1994460739-LARC_ASDC', 'exclude': {}, 'options': {'readable_granule_name': {'pattern': 'true'}}, 'page_num': 1, 'page_size': 20, 'readable_granule_name': ['ACTIVATE-SUMMARY*'], 'sort_key': 'start_date', 'temporal': '2021-11-30T00:00:00.000Z,2022-03-29T23:59:59.999Z', 'two_d_coordinate_system': {}}} res = requests.post( "https://d53njncz5taqi.cloudfront.net/granules", headers={"Authorization": "Bearer YOUR EARTH DATA TOKEN"}, data=json.dumps(body))



## Challenges/Issues
- Since fetching and downloading the .ict files requires keywords like METNAV or SUMMARY, these keywords are not available in the CASEI Database. How can we efficiently get these keywords so that filtering the .ict files can be done efficiently

## Preferred Solution 
- Is to use Option 2 as its much more efficient when compared to Option 1, we don't need to do json request for each Deployment to the list of files and then filter through them.