Closed DaimonPl closed 3 years ago
We'll implement this in the scope of the new UI that we plan to release in ver 0.7
Speaking about the performance, can you share some stats of your lineage data? E.g.
Understanding the use case and lineage patterns would help us to optimize the persistence, queries and find more effective scalability models.
Thanks a lot.
@wajda regarding internal data metrics i could run some queries on arangodb if you have them - currently I have no idea about internals so it's hard for me to answer those questions
Currently I enabled spline for 1 project with daily data retention (that is, spline and arangodb are cleaned completely everyday)
Here's what main lineage overview graph looks like
For that graph 'lineage-overview' returns 28.7kB in 3.5 seconds - it's not super long but already noticable - especially without "loading" indicator. I'm on VPN from home though so this might add some delay too.
It's single project, some of input datasources may be similar size but they are not yet enabled with spline
Biggest lineage-detailed
after clicking on graph returns 265kB in 0.7 seconds. 265kB is really a lot especially that only list of input/output URIs is displayed. Looks like endpoints could be somehow specialized to return less data for such cases
Wow that's awesome! :) That would be really helpful to get more precises statistics.
Detailed lineage Yes there is a room for JSON size optimization. But I think being gzipped it shouldn't grow that fast, so it's not that bad I would say. What can blow this JSON up however is crazy wide datasets (thousands of columns of complex types). That's what requires some extra thoughts IMO.
Lineage overview The biggest challenge with that one is the graph traversing, especially if there are many appends to the data sources. To display correct lineage we select all visible reads from every write perspective, and traverse each route recursively, and that is expensive. In future we'll add asynchronous pre-linking on background to move some work away from the user request-time. Another precaution that we have implemented for possible combinatorial explosion is a max graph depth threshold. Currently the depth == 10 is hardcoded on the UI (meaning 10 dependent job in line). In the UI v0.5.2 we added a button to increase this depth on demand if the threshold is reached. We also plan to implement more sensible graph navigation mechanism in the future versions.
I'll send you the queries, thanks! Just for curiosity, is it any sort of ML data pipeline? I see cyclic dependencies between jobs, so I wonder.
It's not pure ML but it's a bit more complex data processing pipeline. There are no thousands of columns here for sure :)
There should be no cyclic dependencies. I think there are 3 problems which make graph reading difficult:
For graph view itself i've created https://github.com/AbsaOSS/spline/issues/718 to make it more clear
@DaimonPl can you run this AQL on your biggest DB and share the result?
RETURN {
"operations" : LENGTH(operation),
"dataSource" : LENGTH(dataSource),
"exec-plans" : LENGTH(executionPlan),
"exec-events" : LENGTH(progress),
"appends" : LENGTH(operation[* FILTER CURRENT._type == "Write" AND CURRENT.append]),
"unique-apps" : LENGTH(UNIQUE(executionPlan[*].extra.appName)),
"top-io-per-ds" : (
LET pairs = (
FOR ds IN dataSource
FOR op IN 1 INBOUND ds writesTo, readsFrom
COLLECT t = op._type == "Read" ? "reads"
: op.append ? "appends"
: "overwites",
dsId = ds._key WITH COUNT INTO c
SORT c DESC
RETURN [t, c]
)
FOR p IN pairs
COLLECT t = p[0] into g
RETURN [t, g[* LIMIT 20].p[1]]
)
}
Sure, but I'll be able to do it next Monday (holidays :) and no access to company network :) )
Sure. Happy holidays :)
And this one as well please:
RETURN {
"top-longest-observed-writes-seqs" : (
FOR p IN progress
LET cnt = LENGTH(SPLINE::OBSERVED_WRITES_BY_READ(p))
FILTER cnt > 0
SORT cnt DESC
LIMIT 20
RETURN cnt
)
}
@wajda I'll gather stats tomorrow. Today we'll try to enable additional projects with spline, this should give better stats
Ok, so here are stats for two bigger processing projects and several medium/smaller. It's still only part of entire data processing but spline is not yet enabled everywhere (also included projects have datasource dependencies which themselves may have quite big graphs but theya re not yet in spline)
here are results of queries (db still from spline 0.5.1)
[
{
"operations": 36435,
"dataSource": 362,
"exec-plans": 412,
"exec-events": 412,
"appends": 0,
"unique-apps": 131,
"top-io-per-ds": [
[
"overwites",
[
12,
12,
12,
12,
11,
11,
11,
10,
5,
3,
3,
3,
3,
3,
3,
2,
2,
2,
2,
2
]
],
[
"reads",
[
1836,
169,
117,
108,
83,
68,
50,
47,
47,
46,
44,
42,
36,
34,
34,
34,
34,
32,
29,
28
]
]
],
"top-longest-observed-writes-seqs": [
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10,
10
]
}
]
Bigger lineages may take several seconds to load
It would be good if UI would:
It's for both whole graph and details