NASA-IMPACT / veda-data

4 stars 0 forks source link

Automate production of facebook population data layer and STAC generation #78

Closed abarciauskas-bgse closed 1 year ago

abarciauskas-bgse commented 2 years ago

Description

Many datasets will be accumulating data in an ongoing basis. This ticket is intended to automated the production of facebook population data, and should serve as a reference implementation so that other datasets can be automated easily in the future.

Acceptance Criteria

Checklist under previous ticket (old)

For each dataset, we will follow the following steps:

    • [x] If the dataset is ongoing (i.e. new files are continuously added and should be included in the dashboard), design and construct the forward-processing workflow. I don't believe this dataset is ongoing at this time
    • [x] Verify the COG output with the science team by sharing in a visual interface.
    • [x] Verify the metadata output with STAC API developers and any systems which may be depending on this STAC metadata (e.g. the front-end dev team).
    • [x] Add the dataset to the ~production dashboard~ staging API
leothomas commented 2 years ago

Just a quick note - the facebook population density dataset receives semi-regular updates, which you can see by inspecting the s3 bucket:

>>> aws s3 ls s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/ --no-sign-request
                           PRE v1.5/
                           PRE v1/
2022-01-26 16:26:45     109890 hrsl_general-latest.vrt
2021-05-11 23:27:01     104980 hrsl_general-v1.5.1.vrt
2022-01-26 16:26:52     109890 hrsl_general-v1.5.10.vrt
2021-05-27 20:10:11     106425 hrsl_general-v1.5.2.vrt
2021-07-12 19:01:39     109827 hrsl_general-v1.5.3.vrt
2021-10-05 00:55:41     109827 hrsl_general-v1.5.4.vrt
2021-10-20 15:49:31     109835 hrsl_general-v1.5.5.vrt
2021-11-12 16:51:51     109847 hrsl_general-v1.5.6.vrt
2021-11-24 14:06:13     109847 hrsl_general-v1.5.7.vrt
2021-12-06 17:56:27     109847 hrsl_general-v1.5.8.vrt
2022-01-13 18:52:04     109859 hrsl_general-v1.5.9.vrt
2021-04-15 23:59:58     104888 hrsl_general-v1.vrt

inspecting the v1.5/ subfolder shows geotiffs corresponding to multiple versions of the .vrt files:

>>> aws s3 ls s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/v1.5/ --no-sign-request
...
2022-01-21 14:21:01   45203598 cog_globallat_10_lon_0_general-v1.5.4.tif
2022-01-21 14:21:01        188 cog_globallat_10_lon_0_general-v1.5.4.tif.aux.xml
2021-07-12 18:16:24   30892249 cog_globallat_10_lon_10_general-v1.5.1.tif
2021-07-12 18:16:24        188 cog_globallat_10_lon_10_general-v1.5.1.tif.aux.xml
2021-10-20 13:53:35   30105955 cog_globallat_10_lon_10_general-v1.5.2.tif
2021-10-20 13:53:36        188 cog_globallat_10_lon_10_general-v1.5.2.tif.aux.xml
2021-11-12 13:44:26   31200924 cog_globallat_10_lon_10_general-v1.5.3.tif
2021-11-12 13:44:26        188 cog_globallat_10_lon_10_general-v1.5.3.tif.aux.xml
...

I've been semi-periodically re-generating the facebook population density dataset to stay updated with new data delivered to the bucket, using the following commands:

# download .vrt file (which point to data in both `v1/` and `v1.5/` so we have to download both
aws s3 cp s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/hrsl_general-latest.vrt . 

# download datafiles
aws s3 sync s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/v1 ./v1
aws s3 sync s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/v1.5 ./v1.5

GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR rio cogeo create ./hrsl_general-latest.vrt hrsl_general_latest_global_cog.tif --allow-intermediate-compression --blocksize 512 --overview-blocksize 512

it takes about ~2.5 to 3.0 hours to run on my laptop.

You can run the COG creation process directly from the S3 vrt files too, but this will take considerably longer (closer to 5 or 6 hours). This likely has something to do with needing to download each of the datafiles individually from S3.

GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR rio cogeo create s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/hrsl_general-latest.vrt hrsl_general-latest.tif --blocksize 512 --overview-blocksize 512 --allow-intermediate-compression

It's definitely not ideal to have such a massive process running on my laptop for so long, but I haven't had time to investigate if this can be done quicker/cheaper using an EC2 instance, or something along those lines.

Misc. notes:

leothomas commented 2 years ago

Link to paper describing dataset development techniques and tools used: https://arxiv.org/pdf/1712.05839.pdf

slesaad commented 2 years ago

@abarciauskas-bgse

Done:

Issue: Although the STAC item exists (here), it doesn't seem to be indexed into the collection (here)

Ran into this issue with the blue tarp collection too, but there wasn't a solution per se (it somehow just worked after ingesting more data). There might be an underlying issue that causes the item to not be indexed into the collection. @anayeaye can we troubleshoot this?

slesaad commented 2 years ago

Automatic COG creation workflow

Like @abarciauskas-bgse suggested, trigger an ECS task when a file is uploaded to the s3://dataforgood-fb-data/hrsl-cogs/hrsl_general. We should probably be able to use an ECS cluster running a fargate task.

Experiments

I successfully tested triggering a fargate task on upload to a s3 bucket. (a random file creation task, not actual creation of population density cog, it take a long time to try to test :D).

I had to create the following resources:

  1. S3 bucket (here)
  2. ECR (here): docker build and push
  3. ECS cluster (here) to run the for the fargate task
  4. Fargate task definition (here) - specify container (github: task definition)
  5. IAM role (here): For the task definition - to be able to access the s3 bucket
  6. Eventbridge rule (here)- on event: s3 upload, trigger target: ECS cluster task (invocation type fargate) [Note: didn't find an option to select the path inside the bucket, but there has to be some way to do that.

Note: I had to turn on 'Event Notification' on the S3 bucket - we do not know if this is enabled for the dataforgood s3 bucket.

Thoughts and discussion

For the COG creation, it seems that we need to download 5+ GB worth of data (which takes quite some time) from the dataforgood bucket, run the cog generation command, which takes quite a few hours and creates a COG of size around 15+ GB. It seems that the data update frequency is on average like once a month.

We might need to think about persistence of the downloaded data (which is 5+ GB, and might go on increasing) and just syncing the new data? Need to do some calculations to see if that's worth it/cheaper/efficient.

If we do want data persistence,

# Working on the github branch: https://github.com/NASA-IMPACT/cloud-optimized-data-pipelines/tree/facebook-population-density/data-workflows/facebook-population-density

abarciauskas-bgse commented 2 years ago

Thanks @slesaad I think the next step would be to create CDK code for all the resources you created manually (I think you created them manually?

For now I think we can add documentation for what you have done (can copy what you have in the comment above but with more description about the resources created, so you don't have to see the resource in AWS to create it with CDK, ideally) into that directory and open a PR.

Do you have the scripts used to create the STAC collection and item metadata for this dataset?

aboydnw commented 2 years ago

@slesaad @abarciauskas-bgse is this already automated? aka can we close this ticket?

slesaad commented 2 years ago

@aboydnw it is not, this was shelved to focus on more important tasks

aboydnw commented 2 years ago

Updated the initial comment with AC on this ticket, but we should add more as needed

aboydnw commented 2 years ago

Lower priority than other tasks like automating ingestion of GHG and/or Fire data

j08lue commented 1 year ago

Stale