Open lalithsuresh opened 6 months ago
The way to do this now would be to add the count stream along with neighborhood and sample streams to output handles in the catalog. This way the user of the UI will be able to do /egress?query=count
, similar to how we can do /egress?query=neighborhood
.
I am suggesting a single view with information for all outputs and inputs; two columns: table name, count. It is rather tedious to browse a view for a single row.
The compiler could generate a single view like this for relation sizes.
create view system__relation_sizes as
(select "t1", count(* ) from t1)
union
(select "t2", count(*) from t2) .....
Should I put this in the next milestone? It should be controlled by a compiler flag.
Yes, go ahead and add it to the milestone please.
I don't think this is the right way to implement this. Adding extra views will introduce a lot of extra circuitry and slow down compilation. Plus there's an extra config option that needs to be somehow propagated through the API and UI.
A better solution would be to have the catalog to always insert the count operator, and then we can expose it through API and UI, like we do with neighborhoods.
I thought we were moving more code into the compiler from the runtime, not the other way around.
The catalog logic should move into the compiler, yes.
So what do I do for the next milestone?
For this issue, probably nothing.
With long running pipelines, it's currently hard to tell why memory usage is going up. It'd be good to be able to tell how big each relation is (both tables and views), and surface this info through the APIs and the UI.
@mihaibudiu mentions that we can incrementally track sizes by inserting count(*) queries from the compiler, or track this information in system tables (which we don't yet have). I suggest we start with the former for now and switch to systems tables when we have support for them.