davidgasquez / gitcoin-grants-data-portal

🌲 Open source, serverless, and local-first data hub for Gitcoin Grants data!
https://grantsdataportal.xyz/
MIT License
26 stars 3 forks source link

Consider storing `metadata` columns as well-formed JSON? #3

Closed DistributedDoge closed 8 months ago

DistributedDoge commented 8 months ago

Indexer has several objects (projects, rounds, applications) with metadata served as well-formed, nested json.

      "metadata": {
            "signature": "0xacb3be5c327477a5[...]",
            "application": {
                "round": "0xddc627acc685c2a3fa67bc311a5318d1ae2ce899",
...

When inspecting database tables raw_projects, raw_rounds, raw_round_applications from the latest release (dbt.duckdb file) somewhere along the pipeline metadata (varchar) column in each of those tables was transformed into pythonic format that can no longer satisfy JSON parser, but needs to be passed to python ast.literal_eval():

{'title': 'FTM Test Project', 'description': "Just a description here 🫣 don't mind me.  "...

I wanted to be able to run a simple Duckdb query to extract interesting fields like one below. This won't work as-is, because single quotes break JSON parser:

select json_extract_string(metadata,'$.title') from public.raw_projects;

To achieve that quickly I modified asset generation step as follows (ggdp/assets.py). Goal is to ensure that Pandas dataframe generated by dagster contains JSON string instead of a collection of Python objects. This seems to be working on my fork. For future reference I should probably just replace metadata column instead of duplicating it.

@asset
def raw_projects() -> pd.DataFrame:
    projects = chain_file_aggregator("projects.json")
    projects['json_metadata'] = projects['metadata'].apply(json.dumps)
    return projects

I am writing this to suggest that upstream could also benefit from having metadata for raw_ tables in JSON format.

davidgasquez commented 8 months ago

Thanks for opening the issue @DistributedDoge!

As I haven't used the metadata fields, I didn't realize they were being treated as Python objects.

I'll update the ingestion code to make everything proper JSON!

Meanwhile, do you mind sending a PR upstream with your changes? No pressure as I was thinking on redoing the Python assets soon.

davidgasquez commented 8 months ago

Closed by https://github.com/davidgasquez/gitcoin-grants-data-portal/pull/4.