davidgasquez / gitcoin-grants-data-portal

🌲 Open source, serverless, and local-first data hub for Gitcoin Grants data!
https://grantsdataportal.xyz/
MIT License
26 stars 3 forks source link

Make pipeline incremental #28

Open davidgasquez opened 8 months ago

davidgasquez commented 8 months ago

The main idea is to rely on the latest portal data and run smaller incremental on CI. We should provide a --full-refresh flag ala dbt to make data from scratch.

This is a big one!

davidgasquez commented 8 months ago

The ideal approach I can think of would be to rely on Dagster partitions and sensors.

  1. Read the data from IPFS (or github actions cache!)
  2. Run Dagster sensors to check which partitions are missing.
  3. Run code for missing partitions and rematerialize datasets.

Perhaps there is a much easier approach we can use while we figure out all thhe Dasgter stuff.

davidgasquez commented 7 months ago

Thinking about relying on external assets. Make the previous run the external assets and compute the diff using sensors?

davidgasquez commented 7 months ago

We could also attach to the previous database and use it as the current state. Run sensors and then the remaining partitions.