Open jaanli opened 6 months ago
Thanks for creating this Issue @jaanli ! This sounds very interesting and please do keep us in the loop of progress.
I'd say pass on patch-level embeddings. As I explain on #223 I think them are fundamentally skewed by the context in ways that make them less valuable in most cases than chip-level embeddings.
The good news, if you wait a couple of weeks, is that Clay v1 can create embeddings at any chip size. Keep en eye for the v1 release.
I am a bit late to comment on this @jaanli . We did not assess spatial auto-correlation in any of the use cases. We did not get that deep into the downstream applications. The issue does resonate with me though, I have used spatial econometrics in my PhD thesis.
We recently released the v1 version of the model, which works well with Sentinel-2 (10m resolution) and also NAIP data (sub-meter resolution, available for all of the US). The new version can also handle smaller input chips (with less pixels) so the patch level analysis is no longer necessary (it has strong limitations anyway as @brunosan pointed out).
Have a look at the following tutorial on how to use the new version of the model. Happy to help if you hit any roadblocks.
https://clay-foundation.github.io/model/clay-v1-wall-to-wall.html
Has anyone has assessed the spatial autocorrelation error of Clay vs standard models in downstream prediction/fine tuning tasks?
Here's an example assessment vs Bayesian models: https://www.mdpi.com/1660-4601/18/13/6856
I'm considering generating embeddings at the patch level and using these to classify tree cover based on this tutorial:
https://clay-foundation.github.io/model/tutorial_digital_earth_pacific_patch_level.html
The tree dataset is here: https://tree-map.nycgovparks.org/
If there is a more appropriate starting point, let me know!
Use case context if it's helpful:
I've been working on health equity metrics at the neighborhood level, and think Clay could be a good fit for applying this framework: https://treesasinfrastructure.com/
To this data: https://jaanli.github.io/new-york-real-estate/
Linked to these demographics that have spatial components: https://jaanli.github.io/american-community-survey/new-york-area/income-by-race
Where every Census Bureau-defined "microdata area" is linked to health outcomes computed from claims datasets such as: https://onefact.github.io/synthetic-healthcare-data/
The hardest part here will be error analysis for looking at spatial autocorrelation of this deep model compared to conventional models like logistic regression. Moran plots are helpful debugging tools (https://connordonegan.github.io/geostan/articles/spatial-me-models.html).
(Before using downstream fine-tuning predictions for resource allocation and public health use cases, need to carefully benchmark against the byzantine Census Bureau methods/spatial lag methods/etc)