[question] Best practice to start a new project with datadex

fredguth commented 4 months ago

Gitcoin, Arbitrum and Filecoin Data Portals were built with Datadex and I can see them as examples.

But I was wondering how to best translate to my use case.

I have different teams each with their domain expertise working in different data products (one team may be specialized in budgets and laws regarding municipal, state and federal level spending; other team may be specialized in costs). They share their inexperience with git. Each team is ultimately responsible for their data products.

In this scenario, I thought a repo per product (using a repo template) or at least per team would make more sense than one giant repo. Building each product in a folder is enough to solve the matter of the Data Portal using quarto. But I am confused by the usage of Dagster.

I am totally new to Dagster and I saw that /datadex folder (/fdp in filecoin) has Dagster definitions, which seem to have been manually built for each project (not generated). But eventually, I will want to deploy all my projects to the same production server where I imagine I will run one instance of Dagster, right?

davidgasquez commented 4 months ago

Hey! This is a very interesting question with, unfortunately, no easy answer. It all depends on things like these:

Do you plan to reuse metrics or datasets across domains?
Do you think the domains are clear enough for people to raise issues in the proper repository?

Without knowing many of the specifics of your use case, I'd say a great start might be having everything in the same repo and splitting later on. That way the portals evolve at the same time and there is opportunity for learning.

You can think of Dagster as the Extraction logic side. You could have 1 dagster project per data product or, have a big project covering all the data products and use dagster specific things like groups or code locations.

fredguth commented 3 months ago

There are a few dataset that are indeed used across domains
The domains are quite different and people know which product is of which domain.

At least at the first moment it seems simpler keeping everything in one repo.

Have you seen Mage-ai? It is similar to Dagster, but has its on "Jupyter like" interface for coding.

Another thing that I have noticed is that dbt core does not generate a web page with the tests. At least I haven't seen. I was thinking of something like this: https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:bigquery,calm-pagoda-323403.jaffle_shop.customers,PROD)/Validation/Assertions

The problem of using Datahub is that it is yet another docker with lots of things I really don't need. I wanted just a place to find products with their documentation (schema, lineage, "contract", tests, etc)

fredguth commented 3 months ago

Maybe the best solution for the Data Catalog is building it statically either with Quarto or Something like Evidence.

davidgasquez commented 3 months ago

Yes! That is what I've been doing in the Gitcoin Data Portal. Not the most beautiful way to do it but easy enough!

datonic / datadex

[question] Best practice to start a new project with datadex #49