apache / incubator-heron

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter
https://heron.apache.org/
Apache License 2.0
3.65k stars 598 forks source link

Fix for missing physical plan in UI #3786

Closed nicknezis closed 2 years ago

nicknezis commented 2 years ago

Fixes #3784 and Fixes #3785

nicknezis commented 2 years ago

Nice catch. The tests seem to work for me, but there's definitely more to clean up as you mentioned. I guess I started making the code match this line: https://github.com/apache/incubator-heron/blob/4d1ab84eea99a374c6838cb9401d6a7fc908af24/heron/tools/ui/resources/static/js/stat-trendlines.js#L264 with the assumption we had broken something in the Tracker cleanup. But looking back at old versions of the code, this UI query may never have matched up with the tracker api. Very odd. I personally prefer the /metrics/query syntax, but might be easier to just roll back to /metricsquery style route. And instead I'll update the javascript to use the proper api call.

surahman commented 2 years ago

But looking back at old versions of the code, this UI query may never have matched up with the tracker api. Very odd.

My guess is technical debt.

I personally prefer the /metrics/query syntax, but might be easier to just roll back to /metricsquery style route.

I agree. Good REST API design necessitates versioning of the API as well as nesting under path routes. All metrics queries should be nested under the /metrics/ path.

Edit: I stand corrected, this is generating a query request.

https://github.com/apache/incubator-heron/blob/dd23309cba06853e495f378f5d2cf2c50706a41e/heron/tools/ui/resources/static/js/plan-stats.js#L378

nicknezis commented 2 years ago

I'm running into some other issues with parameter mismatch betwen ui and tracker. So I'm doing a bit of testing. One example is topology vs topology_name. There is an alias here, but it didn't seem to work. I was testing with a failed Tracker API call and got different response when using either topology or topology_name. Perhaps the alias isn't working? @Code0x58 might have some insight in helping answer that question.

Another issue I'm seeing is an exception on an API call because role isn't present. But it's listed as Optional. The parameter is treated differently between the two set of routes. Example 1 and Example 2.

surahman commented 2 years ago

I have always had issues bringing up the UI so I can only look at the code.

One example is topology vs topology_name. There is an alias here, but it didn't seem to work.

The naming convention for topologies has changed on K8s and I am not sure if this would affect the UI. Labels have the topology's name and the StatefulSet name has either -manager or -executors appended to it. I am not sure how the tracker manages the scraping and collecting of stats.

Another issue I'm seeing is an exception on an API call because role isn't present. But it's listed as Optional. The parameter is treated differently between the two set of routes. Example 1 and Example 2.

According to the Optional docs, a value of None will be returned as required. How is a role value of None handled? Exception? What changed?

nicknezis commented 2 years ago

So for testing heron-tracker and heron-ui, I realized doing the local install is much easier. You can try this:

  1. bazel run --config=darwin_nostyle -- scripts/packages:heron-install.sh --user
  2. Add ~/.heron/bin/ to your $PATH
  3. Submit analytic:
    heron submit local ~/.heron/examples/heron-streamlet-examples.jar \  org.apache.heron.examples.streamlet.WindowedWordCountTopology \
    WindowedWordCountTopology \
    --deploy-deactivated
  4. Run the tracker: heron-tracker --type file --rootpath ~/.herondata/repository/state/local/
  5. Run the UI: heron-ui
Code0x58 commented 2 years ago

I'm running into some other issues with parameter mismatch betwen ui and tracker. So I'm doing a bit of testing. One example is topology vs topology_name. There is an alias here, but it didn't seem to work. I was testing with a failed Tracker API call and got different response when using either topology or topology_name. Perhaps the alias isn't working? @Code0x58 might have some insight in helping answer that question.

It looks like alias may be a bit of a misnomer, it may be better to call it source, if this reference covers everything

nicknezis commented 2 years ago

I've been banging my head against this code. Things are definitely broken. I have more drastic changes, but I haven't pushed them yet because I'm still testing. The exceptions page is still broken. Somewhere we lost the combo of component and instances query parameters.

surahman commented 2 years ago

I am time-constrained for the next few weeks and I still cannot bring up the UI. I typically find it faster to jump in with a debugger and find the issue causing the problem and address it there. Failing that it is time to find what caused the issue.

We need to isolate the commit in which this happened. I would start by checking out the last tagged release to see if the UI is working there. If it is, run a Git bisect between that commit and master branch's HEAD to find the problematic commit. You should have a better idea of what caused the issue. There were major changes to the backend of the UI in the PR that bumped Python to 3.8/3.9, so that is a potential candidate.

nicknezis commented 2 years ago

I figured out one of my main blockers that I was banging my head against. The Topology object that I was referencing was a Protobuf object. Once I reference the proper TopologyInfo type, the code worked better. Finally making progress again. Will have more commits to this PR in the near future.

nicknezis commented 2 years ago

I now have most of it working, except the exceptions.json call on heron-ui. I think I'm going to have to put some of the ResponseEnvelope stuff back. Running into an issue with the ApiResponse requiring a dict, but it's getting a list instead. Hopefully will have this all figured out soon.

nicknezis commented 2 years ago

To give some context to one of the issues I ran into with this work. In the Heron Tracker, we are now using FastAPI. There is a limitation in the framework which does not allow for custom routers in a nested router. Some information can be found here: https://github.com/tiangolo/fastapi/issues/4110

So I think my options are as follows.

  1. Remove the envelope which was part of each returned API call. Not sure if it's a great idea to have. We didn't really use the feature. We never set the timing info. We already have HTTP return codes for status.
  2. Keep the envelope on API returns, but implement in each handler method. The custom API Router decoration doesn't seem to work.
  3. I can technically hack it by making separate nested App instantiations, but this seems wrong.

In the PR, I've gone back and forth between option 1 and 2 as I worked through various issues. I'm currently leaning towards option 1. This is reflected in the PR as it currently stands. But I can be convinced that it is better to go back to the original API responses with the envelope on each Heron Tracker API result.

surahman commented 2 years ago
  1. Remove the envelope which was part of each returned API call. Not sure if it's a great idea to have. We didn't really use the feature. We never set the timing info. We already have HTTP return codes for status

If option 1 simplifies the code then I would agree with going down this route but my fear is that it might break something elsewhere. I think it would likely be better to go down this route and bring things up to current spec than to hack things or work around the issue. Let us plug this issue permanently or it will resurface later.

For option 2, how many/which handlers would we need to update to handle the encasing envelopes?

I am looking things over and I will try to assist as much as I can on this PR.

surahman commented 2 years ago

I think we can safely remove the ResponseEnvelope. There may have been grander plans for its use, but they do not seem to have come to fruition.

https://github.com/apache/incubator-heron/blob/8841d1c4d6b3156a40ffed856fb32c1ceb682855/heron/tools/tracker/src/python/utils.py#L56-L63

It is just a data struct with references to it restricted to: https://github.com/apache/incubator-heron/blob/8841d1c4d6b3156a40ffed856fb32c1ceb682855/heron/tools/tracker/src/python/app.py https://github.com/apache/incubator-heron/blob/8841d1c4d6b3156a40ffed856fb32c1ceb682855/heron/tools/tracker/src/python/utils.py The values of importance are limited to the HTTP response codes, which are themselves restricted to success and failure. Nothing else in the data structure is ever used in a worthwhile way.

nicknezis commented 2 years ago

Fixes #3801

nicknezis commented 2 years ago

I think I have most of the UI working again. But having issues with this menu on the Exception page. I'm not sure, but this might have been broken for a long time before. I'm inclined to wrap up this PR to get it into the master branch. We can fix the exceptions menu at a later point. Don't want to leave the main branch broken for too long.

image
surahman commented 2 years ago

We can fix the exceptions menu at a later point. Don't want to leave the main branch broken for too long.

I agree, let us get this into a stable state and merge. We can create another set of issues for what we know is broken and outstanding.

surahman commented 2 years ago

There are multiple test failures, some of which are justified, and others that require updates to the test suite.

The following failures are related to the JSON responses. I think you are better positioned to review and update the tests depending on how you designed the production code.

https://github.com/apache/incubator-heron/blob/af7053ec8162ec43ea7b58d4ddf746ff2b2ab43d/heron/tools/tracker/tests/python/app_unittest.py#L37-L38

https://github.com/apache/incubator-heron/blob/af7053ec8162ec43ea7b58d4ddf746ff2b2ab43d/heron/tools/tracker/tests/python/app_unittest.py#L45

https://github.com/apache/incubator-heron/blob/af7053ec8162ec43ea7b58d4ddf746ff2b2ab43d/heron/tools/tracker/tests/python/app_unittest.py#L52

EDIT: ~I have pushed some test changes to the repo for the failing integration tests above.~ These tests should not be updated unless the production logic surrounding them has been updated to handle the output in the tests.

nicknezis commented 2 years ago

Another realization I had thanks to some chatting with @surahman ... The Tracker client is not only used by the Heron UI. There seem to also be references in Heron Explorer (I'm not 100% sure what that code does). But we will need to ensure those API changes did not break those referencing components.

surahman commented 2 years ago

This is an interesting set of issues to resolve and is complicated by a lack of access to design documents. There are a lot of moving parts and interdependencies.

nicknezis commented 2 years ago

Verified that the changes didn't break Heron Explorer (I actually had never used it before, but it's pretty handy tool). The Tracker client code helped abstract away the changes.

nicknezis commented 2 years ago

Updated pylint to resolve an issue with it failing on Python 3.9+. https://github.com/PyCQA/pylint/issues/3882

nicknezis commented 2 years ago

I'm now updating to resolve the style fix suggestions.

surahman commented 2 years ago

I will get started on the bottom end of the list of style errors to help out. Files I am looking at:

Completed:

************* Module basics.base_instance
************* Module basics.bolt_instance
************* Module basics.spout_instance
************* Module instance
************* Module network.gateway_looper
************* Module network.heron_client
************* Module network.metricsmgr_client
************* Module network.protocol
************* Module network.socket_options
************* Module utils.metrics.metrics_helper
************* Module utils.metrics.py_metrics
************* Module utils.misc.pplan_helper
************* Module utils.misc.serializer_helper
************* Module utils.topology.topology_context_impl

Someone with more Python3 experience should look over my work. Each file is in its own commit so that it can be dropped if problematic.

surahman commented 2 years ago

Resolving the merge conflict with changes from this branch.

surahman commented 2 years ago

I think these style fixes will require editing most of the Python codebase. The errors Just keep propagating.

nicknezis commented 2 years ago

I think these style fixes will require editing most of the Python codebase. The errors Just keep propagating.

Yes. Lots of Python changes. But most of them seem to be related to the f-string syntax. Hopefully once we get through that, it won't be too bad. Just takes a lot of time to fix each one and repeat the process.

nicknezis commented 2 years ago

@surahman Could you help me convert the deprecated asyncore to asyncio? I might just add the waiver for now, but we should update it because it will be removed in Python 3.12. What do you think? I don't know how easy it will be to switch over.

Edit: Made ticket #3810

surahman commented 2 years ago

@surahman Could you help me convert the deprecated asyncore to asyncio?

I have not worked with either of those libraries but I can look into it when I have some time. It is best to address this before it becomes an issue.

surahman commented 2 years ago

I just removed a reference to update the global Log that I added.

nicknezis commented 2 years ago

I can't figure out why the timeline is not working. I've tracked my issue to this line of code. Within the compute_max function, I can print the final dict that is returned. It has values. But when I print data, it only has an empty dict.

surahman commented 2 years ago

I am not sure but I do not think there is a need to convert the zipped together values into a list and then a dict. But then I also do not believe this would affect the return value.

https://github.com/apache/incubator-heron/blob/06e9f75988fd254053b640689879c3393703b0dd/heron/tools/common/src/python/clients/tracker.py#L626

Maybe try this:

return dict(zip(keys, values))
nicknezis commented 2 years ago

I am not sure but I do not think there is a need to convert the zipped together values into a list and then a dict. But then I also do not believe this would affect the return value.

https://github.com/apache/incubator-heron/blob/06e9f75988fd254053b640689879c3393703b0dd/heron/tools/common/src/python/clients/tracker.py#L626

Maybe try this:

return dict(zip(keys, values))

I think zip() returns an Iterator and the list() call converts to an actual List. I was not familiar with the zip method, but it seems to be a standard pattern. When I print the value returned from that call I can see the value. It's only when that same result is returned from the compute_max that I get an empty {} in the data variable.

surahman commented 2 years ago

I think zip() returns an Iterator and the list() call converts to an actual List.

I believe dict might have a constructor to handle zip return values as seen here.

Code0x58 commented 2 years ago

I can't figure out why the timeline is not working. I've tracked my issue to this line of code. Within the compute_max function, I can print the final dict that is returned. It has values. But when I print data, it only has an empty dict.

Did you do that print within the len(filtered_ts) > 0 and len(filtered_ts[0]["timeline"]) > 0: branch of compute_max(...)? I'm wondering if the {} you are seeing is from the fall-though branch.

Beware that timelines and values in compute_max(...) are generators, so will be exhausted if you do something with them before they are used in the return dict(list(zip(keys, values))), so if you did print(dict(list(zip(keys, values)))) it would consume values and lead to zip of the keys list and an empty generator which results in an empty zip (python 3.10 adds strict=True to help detect these) and would so make the return dict(...) line return an empty dict.

p.s.

I am not sure but I do not think there is a need to convert the zipped together values into a list and then a dict. But then I also do not believe this would affect the return value.

https://github.com/apache/incubator-heron/blob/06e9f75988fd254053b640689879c3393703b0dd/heron/tools/common/src/python/clients/tracker.py#L626

Maybe try this:

return dict(zip(keys, values))

both are equivalent and will be handled the same way due to python's duck typing, as both a generator (returned by zip, in type annotations it is compatible with typing.Iterator but is more strictly a typing.Generator) and list will be treated as iterables. The use of list(generator) can be faster for some cases, but constructs an object which will be discarded so can be a waste of memory.

surahman commented 2 years ago

Thank you for the detailed explanation!

nicknezis commented 2 years ago

I think this PR is finally done. I'm now able to see the timeline in the UI and most of the functionality is working (as good as it probably was working in the past). There are some super deep menus I think I'd like to clean up in the future, but this is good to go. Thank you for the help @surahman @Code0x58 and @thinker0 !

Edit: I'll add. This PR fixes the following functionality which was temporarily broken with the upgrade to Python 3. Most of the issues were related to the Heron UI and Heron Tracker APIs. Some of the issues were due to the updated code enforcing type hints in and out of methods. I've manually verified the below items.

  1. Front page status circles
  2. Front page timelines
  3. Exception browser in the UI
  4. Topology Counters
  5. Container File browser
  6. File stats
  7. Log viewer
thinker0 commented 2 years ago

+1

My Env Test Success

thinker0 commented 2 years ago

3814 typo