Closed abarciauskas-bgse closed 1 year ago
Just a quick note - the facebook population density dataset receives semi-regular updates, which you can see by inspecting the s3 bucket:
>>> aws s3 ls s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/ --no-sign-request
PRE v1.5/
PRE v1/
2022-01-26 16:26:45 109890 hrsl_general-latest.vrt
2021-05-11 23:27:01 104980 hrsl_general-v1.5.1.vrt
2022-01-26 16:26:52 109890 hrsl_general-v1.5.10.vrt
2021-05-27 20:10:11 106425 hrsl_general-v1.5.2.vrt
2021-07-12 19:01:39 109827 hrsl_general-v1.5.3.vrt
2021-10-05 00:55:41 109827 hrsl_general-v1.5.4.vrt
2021-10-20 15:49:31 109835 hrsl_general-v1.5.5.vrt
2021-11-12 16:51:51 109847 hrsl_general-v1.5.6.vrt
2021-11-24 14:06:13 109847 hrsl_general-v1.5.7.vrt
2021-12-06 17:56:27 109847 hrsl_general-v1.5.8.vrt
2022-01-13 18:52:04 109859 hrsl_general-v1.5.9.vrt
2021-04-15 23:59:58 104888 hrsl_general-v1.vrt
inspecting the v1.5/
subfolder shows geotiffs corresponding to multiple versions of the .vrt
files:
>>> aws s3 ls s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/v1.5/ --no-sign-request
...
2022-01-21 14:21:01 45203598 cog_globallat_10_lon_0_general-v1.5.4.tif
2022-01-21 14:21:01 188 cog_globallat_10_lon_0_general-v1.5.4.tif.aux.xml
2021-07-12 18:16:24 30892249 cog_globallat_10_lon_10_general-v1.5.1.tif
2021-07-12 18:16:24 188 cog_globallat_10_lon_10_general-v1.5.1.tif.aux.xml
2021-10-20 13:53:35 30105955 cog_globallat_10_lon_10_general-v1.5.2.tif
2021-10-20 13:53:36 188 cog_globallat_10_lon_10_general-v1.5.2.tif.aux.xml
2021-11-12 13:44:26 31200924 cog_globallat_10_lon_10_general-v1.5.3.tif
2021-11-12 13:44:26 188 cog_globallat_10_lon_10_general-v1.5.3.tif.aux.xml
...
I've been semi-periodically re-generating the facebook population density dataset to stay updated with new data delivered to the bucket, using the following commands:
# download .vrt file (which point to data in both `v1/` and `v1.5/` so we have to download both
aws s3 cp s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/hrsl_general-latest.vrt .
# download datafiles
aws s3 sync s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/v1 ./v1
aws s3 sync s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/v1.5 ./v1.5
GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR rio cogeo create ./hrsl_general-latest.vrt hrsl_general_latest_global_cog.tif --allow-intermediate-compression --blocksize 512 --overview-blocksize 512
it takes about ~2.5 to 3.0 hours to run on my laptop.
You can run the COG creation process directly from the S3 vrt files too, but this will take considerably longer (closer to 5 or 6 hours). This likely has something to do with needing to download each of the datafiles individually from S3.
GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR rio cogeo create s3://dataforgood-fb-data/hrsl-cogs/hrsl_general/hrsl_general-latest.vrt hrsl_general-latest.tif --blocksize 512 --overview-blocksize 512 --allow-intermediate-compression
It's definitely not ideal to have such a massive process running on my laptop for so long, but I haven't had time to investigate if this can be done quicker/cheaper using an EC2 instance, or something along those lines.
GDAL_DISABLE_READDIR_ON_OPEN
flagrelativeToVRT=1
flag when defining the source file:
<SourceFilename relativeToVRT="1">v1/cog_globallat_50_lon_90_general-v1.1.tif</SourceFilename>
Link to paper describing dataset development techniques and tools used: https://arxiv.org/pdf/1712.05839.pdf
@abarciauskas-bgse
Done:
Issue: Although the STAC item exists (here), it doesn't seem to be indexed into the collection (here)
Ran into this issue with the blue tarp collection too, but there wasn't a solution per se (it somehow just worked after ingesting more data). There might be an underlying issue that causes the item to not be indexed into the collection. @anayeaye can we troubleshoot this?
Like @abarciauskas-bgse suggested, trigger an ECS task when a file is uploaded to the s3://dataforgood-fb-data/hrsl-cogs/hrsl_general. We should probably be able to use an ECS cluster running a fargate task.
I successfully tested triggering a fargate task on upload to a s3 bucket. (a random file creation task, not actual creation of population density cog, it take a long time to try to test :D).
I had to create the following resources:
Note: I had to turn on 'Event Notification' on the S3 bucket - we do not know if this is enabled for the dataforgood s3 bucket.
For the COG creation, it seems that we need to download 5+ GB worth of data (which takes quite some time) from the dataforgood bucket, run the cog generation command, which takes quite a few hours and creates a COG of size around 15+ GB. It seems that the data update frequency is on average like once a month.
We might need to think about persistence of the downloaded data (which is 5+ GB, and might go on increasing) and just syncing the new data? Need to do some calculations to see if that's worth it/cheaper/efficient.
If we do want data persistence,
# Working on the github branch: https://github.com/NASA-IMPACT/cloud-optimized-data-pipelines/tree/facebook-population-density/data-workflows/facebook-population-density
Thanks @slesaad I think the next step would be to create CDK code for all the resources you created manually (I think you created them manually?
For now I think we can add documentation for what you have done (can copy what you have in the comment above but with more description about the resources created, so you don't have to see the resource in AWS to create it with CDK, ideally) into that directory and open a PR.
Do you have the scripts used to create the STAC collection and item metadata for this dataset?
@slesaad @abarciauskas-bgse is this already automated? aka can we close this ticket?
@aboydnw it is not, this was shelved to focus on more important tasks
Updated the initial comment with AC on this ticket, but we should add more as needed
Lower priority than other tasks like automating ingestion of GHG and/or Fire data
Stale
Description
Many datasets will be accumulating data in an ongoing basis. This ticket is intended to automated the production of facebook population data, and should serve as a reference implementation so that other datasets can be automated easily in the future.
Acceptance Criteria
Checklist under previous ticket (old)
For each dataset, we will follow the following steps:
s3://covid-eo-data/dataforgood-fb-population-density/hrsl_general_latest_global_cog.tif
, the current dataset has this configuration file https://github.com/NASA-IMPACT/covid-api/blob/develop/covid_api/db/static/datasets/fb-population-density.json