Closed WinPlay02 closed 6 months ago
Descriptor | Linter | Files | Fixed | Errors | Elapsed time |
---|---|---|---|---|---|
✅ PYTHON | black | 2 | 0 | 0 | 0.78s |
✅ PYTHON | mypy | 2 | 0 | 2.24s | |
✅ PYTHON | ruff | 2 | 0 | 0 | 0.03s |
✅ REPOSITORY | git_diff | yes | no | 0.02s |
See detailed report in MegaLinter reports
_Set VALIDATE_ALL_CODEBASE: true
in mega-linter.yml to validate all sources, not only the diff_
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 100.00%. Comparing base (
50d831f
) to head (fa53a52
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
The max. amount of pipeline processes is now set to the amount of available CPU cores.
We probably don't need that many:
Most of the time, the processes will idle. As long as the runner is local, 2 workers should be enough. I don't think the user would fire off pipeline runs without waiting for completion of the previous one.
Most of the time, the processes will idle. As long as the runner is local, 2 workers should be enough. I don't think the user would fire off pipeline runs without waiting for completion of the previous one.
I see, that's a lot of memory wasted. But two processes are easily saturated, if two tables are opened in the EDA view (waiting for stats). A normal execution would only be queued and needs to wait for any running pipelines to complete first. The statistical analysis of the EDA view does another pipeline execution after the table has been fetched, according to my observations. I'd say 4 would probably be a good middle ground to limit the processes to.
Also, coverage seems broken again. It doesn't seem to like multiprocessing pools very much.
I'd say 4 would probably be a good middle ground to limit the processes to.
Sure, seems good to me. Can easily be tweaked later.
The statistical analysis of the EDA view does another pipeline execution after the table has been fetched, according to my observations.
But then the original worker is already free again, no?
Also, coverage seems broken again. It doesn't seem to like multiprocessing pools very much.
I would not consider this a blocker for this PR. I'll investigate whether that can be fixed later.
Edit: Maybe this already fixes the issue.
But then the original worker is already free again, no?
When viewing one table, then this is true. But everything would break, as soon as two tables are waiting for stats. This might be too cautious, but since gathering the stats for the EDA view takes a bit of time, this could be irritating.
But everything would break, as soon as two tables are waiting for stats.
Break in what way?
But everything would break, as soon as two tables are waiting for stats.
Break in what way?
Maybe break is not the right description. Manual executions wait for as long as stats and plots are being calculated. For large tables, this can take some time. As a user, I'd think something broke, as nothing (obvious) is happening anymore.
But everything would break, as soon as two tables are waiting for stats.
Break in what way?
Maybe break is not the right description. Manual executions wait for as long as stats and plots are being calculated. For large tables, this can take some time. As a user, I'd think something broke, as nothing (obvious) is happening anymore.
I see, that's an issue for another day (and someone else 😉).
This PR should lead to a modest improvement in startup time, and a small improvement (mostly depending on the content) in runtime.
I don't know what the target is, and how much potential for startup-optimization is left, but I was able to get small and medium tables after a few 100ms, after pressing the "Explore" lens.
This PR should lead to a modest improvement in startup time, and a small improvement (mostly depending on the content) in runtime.
I don't know what the target is, and how much potential for startup-optimization is left, but I was able to get small and medium tables after a few 100ms, after pressing the "Explore" lens.
It already feels a lot better. The last idea I'd have is to add an initializer to the process pool doing something like
def _init():
from safeds.data.tabular.containers import Table
Table()
Then all processes can immediately display a Table
.
There seems to be an issue with the memoization of block & expression lambdas now:
package test
pipeline whoSurvived {
val titanic = Table
.fromCsvFile("titanic.csv")
.removeColumns(["id", "ticket", "cabin", "port_embarked", "fare"]);
val filtered = titanic.filterRows((row) ->
row.getValue("age") as Float <= 100
);
}
If you change the 100
and explore filtered
a couple of times, you eventually stop getting updates.
There seems to be an issue with the memoization of block & expression lambdas now:
package test pipeline whoSurvived { val titanic = Table .fromCsvFile("titanic.csv") .removeColumns(["id", "ticket", "cabin", "port_embarked", "fare"]); val filtered = titanic.filterRows((row) -> row.getValue("age") as Float <= 100 ); }
If you change the
100
and explorefiltered
a couple of times, you eventually stop getting updates.
I can't reproduce this right now (but I'm seeing another somewhat related error, Error during pipeline execution: Document not found
, during exploration).
I suppose this is the problem from above, that exploring causes stats collection to fill up all available pipeline processes, which causes everything after that to stall.
Considering that, I don't know what about the lambdas would be special to break the memoization/runner.
I've started looking into this a little: It seems like inspect.getsource
itself returns outdated values:
2024-04-21 20:57:34.616 [debug] root:Received Message: {"type":"program","id":"166d653e-e280-4455-946c-374a60f4e2a1","data":{"code":{"demo":{"gen_titanic":"# Imports ----------------------------------------------------------------------\r\n\r\nimport safeds_runner\r\nfrom safeds.data.tabular.containers import Column\r\n\r\n# Pipelines --------------------------------------------------------------------\r\n\r\ndef example3():\r\n column = safeds_runner.memoized_static_call(\"safeds.data.tabular.containers.Column\", lambda *_ : Column('test', data=[1, 2, 3]), ['test', [1, 2, 3]], [])\r\n safeds_runner.save_placeholder('column', column)\r\n def __gen_lambda_0(param1):\r\n return (param1) < (2)\r\n allMatch = safeds_runner.memoized_dynamic_call(\"all\", None, [column, __gen_lambda_0], [])\r\n safeds_runner.save_placeholder('allMatch', allMatch)\r\n","gen_titanic_example3":"from .gen_titanic import example3\r\n\r\nif __name__ == '__main__':\r\n example3()\r\n"}},"main":{"modulepath":"demo","module":"titanic","pipeline":"example3"},"cwd":"c:\\Users\\Lars\\OneDrive\\Desktop\\test"}}
2024-04-21 20:57:34.621 [debug] root:Looking up value for key ('safeds.data.tabular.containers._column.Column.all', (ExplicitIdentityWrapperLazy(_value=Column('test', [1, 2, 3]), memory=SharedMemory('wnsm_78221081', size=4096), id=UUID('ae7dbd79-497e-4ebb-ac2f-6aaf47184952'), hash=504026043636193121), ' def __gen_lambda_0(param1):\n return (param1) < (4)\n'), ())
In the program message, the lambda code is
def __gen_lambda_0(param1):
return (param1) < (2)
but in the key it's
def __gen_lambda_0(param1):
return (param1) < (4)
I've also had it throw an exception after adding more lines to a pipeline that contain lambdas:
2024-04-21 21:00:18.067 [debug] [Runner] [9fd84cfe-e31c-4201-b80f-86cef175ae0a] lineno is out of bounds
at E:\Repositories\safe-ds\Runner\src\safeds_runner\server\_pipeline_manager.py line 298
at <frozen runpy> line 226
at <frozen runpy> line 98
at <frozen runpy> line 88
at demo/gen_titanic_example3 line 4
at demo/gen_titanic line 19 (mapped to 'titanic.sds' line 7)
at E:\Repositories\safe-ds\Runner\src\safeds_runner\server\_pipeline_manager.py line 419
at E:\Repositories\safe-ds\Runner\src\safeds_runner\server\_memoization_map.py line 146
at E:\Repositories\safe-ds\Runner\src\safeds_runner\server\_memoization_utils.py line 396
at E:\Repositories\safe-ds\Runner\src\safeds_runner\server\_memoization_utils.py line 343
at E:\Repositories\safe-ds\Runner\src\safeds_runner\server\_memoization_utils.py line 343
at E:\Repositories\safe-ds\Runner\src\safeds_runner\server\_memoization_utils.py line 345
at C:\Users\Lars\AppData\Local\Programs\Python\Python312\Lib\inspect.py line 1282
at C:\Users\Lars\AppData\Local\Programs\Python\Python312\Lib\inspect.py line 1264
at C:\Users\Lars\AppData\Local\Programs\Python\Python312\Lib\inspect.py line 1128
Looks like this Python bug is back. Got it working now 🎉.
Thanks for investigating further, and great that it is now working 🎉
:tada: This PR is included in version 0.12.0 :tada:
The release is available on:
v0.12.0
Your semantic-release bot :package::rocket:
Closes #85
Summary of Changes
4
.metapath
is reset to remove the custom generated Safe-DS pipeline code, only global library imports (and settings) should remain. If this is a concern,maxtasksperchild
can be set to1
, in which case pipeline processes are not reused.