IBM / lakevision

Lakevision is a tool which provides insights into your Apache Iceberg based Data Lakehouse.
16 stars 5 forks source link

Visualisation of Data Freshness #12

Open therealslimjp opened 3 weeks ago

therealslimjp commented 3 weeks ago

I could imagine that it would be useful, to have a tab with visualization of freshness-related statistics, such as graphs of ingestions over last x days, Table Size trends etc.

What do you think? I could start working on it.

If you have other ideas related to this, leave a comment

rakeshJn commented 3 weeks ago

Sure, exactly what I have in mind @therealslimjp.

The only thing I want to get opinion of people is - should we stick with Streamlit or move to some other framework like Flask? Let me open an issue on that and discuss there.

therealslimjp commented 3 weeks ago

Bildschirmfoto 2024-08-27 um 21 52 43

just wrote a quick mockup, what do you think of something like this?

therealslimjp commented 3 weeks ago

also @rakeshJn i need permissions to create branches, can you do this?

juancappi commented 2 weeks ago

Hi @therealslimjp, I think the idea looks great. I'd probably also add an equivalent metric for size (i.e. in GBs). Record count it's great, but not always indicative enough. Maybe add another drop down for records/size?

w.r.t. to branches, you need to fork the repo, create a feature branch in your fork and then create a PR off your fork. More details here: https://docs.github.com/en/get-started/exploring-projects-on-github/contributing-to-a-project#making-a-pull-request

rakeshJn commented 2 weeks ago

Looks good to me, and yes, both for record count and size. I wonder how we will keep it accurate if table maintenance happens and old snapshots are removed. We will lose history, isn't it?

therealslimjp commented 2 weeks ago

Looks good to me, and yes, both for record count and size. I wonder how we will keep it accurate if table maintenance happens and old snapshots are removed. We will lose history, isn't it?

yep, i thought about that too. don't think we can do much against that though. Maybe i'll add just note that it refers to ingested data since compaction (and maybe even limit the datepicker to the first available snapshot? idk yet what's best)