ManageIQ / manageiq-automation_engine

Automation engine for ManageIQ
Apache License 2.0
11 stars 74 forks source link

Explicitly pass a local URI for our client's DRb.start_service #431

Closed jrafanie closed 4 years ago

jrafanie commented 4 years ago

While testing some workspace instantiantion at home, I was seeing this in the logs:

[----] E, [2020-03-12T16:35:06.554957 #39577:3ff30d01cf44] ERROR -- : <AEMethod available_resource_groups>   DRb::DRbConnError: druby://92.242.140.21:54933 - #<Errno::ETIMEDOUT: Operation timed out - connect(2) for "92.242.140.21" port 54933>
[----] E, [2020-03-12T16:35:06.556663 #39577:3ff30d01cf44] ERROR -- : <AEMethod available_resource_groups>   (drbunix:///var/folders/bf/hh6xb_k15wbfzg4tlbkzch300000gn/T/automation_engine20200312-39577-6nm7ah) /Users/joerafaniello/.rubies/ruby-2.6.5/lib/ruby/2.6.0/drb/drb.rb:744:in `rescue in block in open'

Note, the drbunix:///var/folders... unix socket being reported on one side but the druby://92.242... TCP socket reporting a timeout. Weird.

Using verizon fios dns at home, somehow DRb resolves DRb.start_service to a remote DRb service, which I believe happens below because we're not passing a URI so we have no hostname from this URI and it tries to resolve the local hostname: https://github.com/ruby/ruby/blob/v2_6_5/lib/drb/drb.rb#L879-L884

irb(main):001:0> s = DRb.start_service
irb(main):002:0> s.uri
=> "druby://92.242.140.21:51000"

Note, this is the IP address of verizon's DNS assistance program, as mentioned here: https://askubuntu.com/questions/587895/why-is-the-ip-92-242-140-21-connecting-to-one-of-my-ports-is-it-malware#comment1583793_587954

We can avoid this entirely by explicitly telling DRb to use a local URI for our DRb client:

irb(main):020:0>   Dir::Tmpname.create("automation_engine_client", nil) do |path|
irb(main):021:1*     puts DRb.start_service("drbunix://#{path}").uri
irb(main):022:1>     FileUtils.chmod(0o750, path)
irb(main):023:1>   end; nil
drbunix:///var/folders/bf/hh6xb_k15wbfzg4tlbkzch300000gn/T/automation_engine_client20200313-69238-im1kfc

We've seen this weird thing in the past where our server was using unix sockets and the client was failing, trying to access a TCP DRb server. The never merged solution in the past was to remove the TCP socket option from the server but I believe the problem might actually have been what you see above, in the DRb client, which also starts a drb server to talk to the unix socket drb server.

https://github.com/ManageIQ/manageiq-automation_engine/pull/234

Thanks @agrare with helping me debug this.

jrafanie commented 4 years ago

Maybe this should use a local unix socket? Thoughts?

coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 3827


Files with Coverage Reduction New Missed Lines %
app/models/miq_ae_datastore.rb 1 71.08%
lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_method.rb 1 79.38%
lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_object.rb 1 90.99%
app/models/miq_ae_yaml_import.rb 4 97.29%
<!-- Total: 7 -->
Totals Coverage Status
Change from base Build 3798: 0.02%
Covered Lines: 5121
Relevant Lines: 5960

πŸ’› - Coveralls
coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 3853


Totals Coverage Status
Change from base Build 3798: 0.02%
Covered Lines: 5072
Relevant Lines: 5903

πŸ’› - Coveralls
Fryguy commented 4 years ago

@jrafanie As discussed, should we just use a unix socket instead?

Fryguy commented 4 years ago

Also, FWIW, on my machine here's what I get...also using verizon fios dns at home...

require "drb"
drb = DRb.start_service
drb.uri
# => "druby://jfrey-mac.fios-router.home:56755"

That resolves to a 192.168.1.### address.

(EDIT: Discussed offline with @jrafanie and it turns out my modern router actually bridges the 5GHz and 2Ghz ranges to the ethernet and is given an internal IP. Either way it's using a non-local IP.)

jrafanie commented 4 years ago

Ok, I've updated the code to use a unix socket and updated the description. I've tested this by going to Services -> Catalog, choosing "Order" for a service. Prior to this code change, I would get timeout errors. With this code change, I can get the dropdown populated.

mkanoor commented 4 years ago

@jrafanie I dont think we can use static names, since there could be multiple instances of the engine running on a single appliance, each one would need a dedicated connection name.

mkanoor commented 4 years ago

@jrafanie Plus you have methods that can invoke other methods, each method would need a dedicated connection name.

jrafanie commented 4 years ago

@jrafanie Plus you have methods that can invoke other methods, each method would need a dedicated connection name.

@mkanoor see the in-line comment above, I think it addresses your concern

jrafanie commented 4 years ago

One second... I just got this while testing David's original issue:

[----] E, [2020-03-13T15:40:35.536582 #77390:3fdaf18cd3e8] ERROR -- : Method STDERR: The following error occurred during inline method preamble evaluation:
[----] E, [2020-03-13T15:40:35.536745 #77390:3fdaf18cd3e8] ERROR -- : Method STDERR: ArgumentError: too long unix socket path (111bytes given but 104bytes max)
mkanoor commented 4 years ago

πŸ‘

jrafanie commented 4 years ago

One second... I just got this while testing David's original issue:

[----] E, [2020-03-13T15:40:35.536582 #77390:3fdaf18cd3e8] ERROR -- : Method STDERR: The following error occurred during inline method preamble evaluation:
[----] E, [2020-03-13T15:40:35.536745 #77390:3fdaf18cd3e8] ERROR -- : Method STDERR: ArgumentError: too long unix socket path (111bytes given but 104bytes max)

Ok, I shrunk the name to automation_client to match the length of the existing automation_engine used for the server.

I tested this a few more times and haven't seen any errors.

jrafanie commented 4 years ago

@miq-bot rm_label wip

miq-bot commented 4 years ago

Checked commits https://github.com/jrafanie/manageiq-automation_engine/compare/6f566837141d60ab11f566a7a22328597e3b3e71~...1d6d6d9de3d0e7714df01db325ae07e22baccb7c with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint 1 file checked, 0 offenses detected Everything looks fine. :cake:

jrafanie commented 4 years ago

@miq-bot add_label jansa/yes?

gmcculloug commented 4 years ago

We can avoid this entirely by

@jrafanie I was thinking the right answer was going to be switching from verizon. But I guess this approach works as well.

Dropped these changes on an appliance I used yesterday for provisioning Services and VMs. I ran simultaneous service and vm provisioning as well as retirement. Everything looks good.

Thanks.

simaishi commented 4 years ago

Jansa backport details:

$ git log -1
commit 8026eaa398a345b10167619a4dfc138fde624013
Author: Greg McCullough <gmccullo@redhat.com>
Date:   Wed Mar 18 19:47:29 2020 -0400

    Merge pull request #431 from jrafanie/force_client_drb_connection_locally

    Explicitly pass a local URI for our client's DRb.start_service

    (cherry picked from commit 3b71688b4eff4f6ba2741168372cc14dbd541a98)