AstroDigital / modis-ingestor

Scripts and other artifacts for MODIS data ingestion into Amazon public hosting.
MIT License
14 stars 2 forks source link

add index #31

Open matthewhanson opened 7 years ago

matthewhanson commented 7 years ago

Need to add index file at the top level of the bucket containing sceneids, download URL and a small selection of metadata.

landsat includes: entityId,acquisitionDate,cloudCover,processingLevel,path,row,min_lat,min_lon,max_lat,max_lon,download_url

Cloud cover doesn't really apply to MODIS, at least not the MCD43A4 product since it is a composite product. The daily products don't have cloud cover estimates that I recall.

matthewhanson commented 7 years ago

looking for some feedback on what should be included in this index @jflasher @drewbo @scisco So far, definitely granuleid acquisitiondate - it's in the filename, but worth having as a separate field download_url (for landsat-pds this is the https address so points to the index.html file)

I'm not convinced that bounding box coords are needed, but am open to including them if there's strong enough feeling about it. I think it's a lot of redundant info since the tile x, y coordinates define the bbox.

cloudcover is not pertinent, processinglevel for MODIS is essentially defined by the product name (e.g., MCD43A4), but that's included in the granule ID (e.g., MCD43A4.A2017006.h01v08.006.2017018075239 )

I'd also recommend tossing in a shapefile or geojson file in the top level (in s3://modis-pds/) of the MODIS tile grid, possibly in both the original sinusoidal and a cleaned up geographic (including both is helpful because users who transform the sin grid to geographic will find a pretty wonky looking grid due to the prime meridian issue).

scisco commented 7 years ago

Agreed. If we have the image coordinates in the metadata, bounding box is redundant. Having cloudCover in the metadata saves time. Also love the idea of having a top level geojson.

jflasher commented 7 years ago

There was a request for night/day for Landsat as well, but don't think that applies here? For the coordinates, I think it'd be nice if someone can use the index file to find all the scenes that contain a lat/lon. I think this is easiest if the bbox is in the index file, but if someone can get the matching x,y from a top level geojson and then search for those x, y matches in the index file, I think that could work as well?

matthewhanson commented 7 years ago

night/day doesn't apply here, only to the Land Surface Temp products, in which case the daytime and nighttime temps are actually included in the same product (MOD11A1 and MYD11A1) as separate bands.

I think it's better to find the tileid for a geo request, then find matching tile ids, rather than doing a search on tens of thousands of tiles when there is only about 300 unique ones.

Additionally, since MCD43 is a composite over 16 days that is (theoretically) cloud free, a cloud cover % isn't available, nor does it apply.

jflasher commented 7 years ago

👍

matthewhanson commented 7 years ago

index files for each date are added under the the product name: e.g., s3://modis-pds/MCD43A4.006/2017-02-10_scenes.txt

This is because a day is processed in a batch. If the entire day is processed then the the index file for that day is uploaded. This avoids complications of trying to write the same file without there being an index for every granule. If the process is interrupted then no day index gets uploaded and the whole day will be reprocessed during normal gap-filling*.

Example of a day scene index looks like this:

date,download_url,gid
2017-01-07 00:00:00,https://modis-pds.s3.amazonaws.com/MCD43A4.006/23/01/2017007/index.html,MCD43A4.A2017007.h23v01.006.2017018073630
2017-01-07 00:00:00,https://modis-pds.s3.amazonaws.com/MCD43A4.006/22/01/2017007/index.html,MCD43A4.A2017007.h22v01.006.2017018073712
2017-01-07 00:00:00,https://modis-pds.s3.amazonaws.com/MCD43A4.006/18/09/2017007/index.html,MCD43A4.A2017007.h18v09.006.2017018075910

As we talked about however a single scene index is desirable, so each day all the available daily scene index files are concatenated into a single scene.txt which is uploaded to the product "folder" (e.g., s3://modis-pds/MCD43A4.006)**

*I've not yet added a cronjob to do this, but this adds redundancy by running every few days and checking if all days are accounted for.

** Also have not yet added the job to do this, but there are some scripts available to do this efficiently (e.g., https://gist.github.com/jasonrdsouza/f2c77dedb8d80faebcf9)

Will leave this ticket open until the scene.txt file is added.

cc @jflasher @drewbo

jflasher commented 7 years ago

😻