NebraLtd / hm-diag

Helium Miner Diagnostics
https://nebra.io/hnt
MIT License
21 stars 24 forks source link

Manufacturing broken because of gateway_mfr error #264

Closed kashifpk closed 2 years ago

kashifpk commented 2 years ago

@shawaj @SebastianMaj @marvinmarnold

Manufacturing is using images sent to them on Dec 2, 2021. FIRMWARE_VERSION=2021.11.30.1. We still see the error gateway_mfr test finished with error. This means that we're not on-boarding any units until this gets cleared.

This is the screenshot of the output they get:

image

Also the full debug log from syslog.

Dec  3 02:02:47 hpt6 2021-12-03 02:02:47,735 INFO  [hpt] Decoded diagnostics report: {'diagnostics_passed': False, 'errors': ['ECC', 'BN', 'LOR'
, 'OK', 'PK', 'PF'], 'serial_number': '00000000b9c4a6f2', 'ECC': 'gateway_mfr test finished with error, {"result": "fail", "tests": [{"output": 
"ok", "result": "pass", "test": "serial"}, {"output": "unlocked", "result": "fail", "test": "zone_locked(data)"}, {"output": "unlocked", "result
": "fail", "test": "zone_locked(config)"}, {"output": "invalid slot: 0", "result": "fail", "test": "slot_config(0..=15, ecc)"}, {"output": "inva
lid slot: 3", "result": "fail", "test": "key_config(0..=15, ecc)"}, {"output": "ecc608 error\\n\\nCaused by:\\n    ecc error ExecError", "result
": "fail", "test": "miner_key(0)"}, {"output": "ecc608 error\\n\\nCaused by:\\n    ecc error ExecError", "result": "fail", "test": "sign(0)"}, {
"output": "ecc608 error\\n\\nCaused by:\\n    ecc error ExecError", "result": "fail", "test": "ecdh(0)"}]}', 'E0': 'F0:4C:D5:5A:52:73', 'eth_mac
_address': 'F0:4C:D5:5A:52:73', 'W0': '34:C3:D2:37:E0:72', 'wifi_mac_address': '34:C3:D2:37:E0:72', 'BN': 'Env var BALENA_DEVICE_NAME_AT_INIT no
t set', 'BALENA_DEVICE_NAME_AT_INIT': 'Env var BALENA_DEVICE_NAME_AT_INIT not set', 'ID': '61e1191213e5ef678b71ec612c4128a0', 'BALENA_DEVICE_UUI
D': '61e1191213e5ef678b71ec612c4128a0', 'BA': 'HELIUM-OUTDOOR-915', 'BALENA_APP_NAME': 'HELIUM-OUTDOOR-915', 'FR': '915', 'FREQ': '915', 'FW': '
2021.11.30.1', 'FIRMWARE_VERSION': '2021.11.30.1', 'VA': 'NEBHNT-OUT1', 'VARIANT': 'NEBHNT-OUT1', 'BT': True, 'LTE': False, 'LOR': False, 'lora'
: False, 'OK': 'gateway_mfr exited with a non-zero status', 'onboarding_key': 'gateway_mfr exited with a non-zero status', 'PK': 'gateway_mfr ex
ited with a non-zero status', 'public_key': 'gateway_mfr exited with a non-zero status', 'PF': False, 'legacy_pass_fail': False}
Dec  3 02:02:47 hpt6 2021-12-03 02:02:47,779 DEBUG [hpt] Trying to get version information from /version.
Dec  3 02:02:47 hpt6 2021-12-03 02:02:47,805 INFO  [hpt] 🟢  Version Information:#012diagnostics_version: 72f141f#012firmware_version: 2021.11.30.1
Dec  3 02:02:47 hpt6 2021-12-03 02:02:47,820 INFO  [hpt] Processing miner using new hpt logic.
Dec  3 02:02:47 hpt6 2021-12-03 02:02:47,828 ERROR [hpt] Miner FAILED diagnostics because: {'ECC', 'LOR', 'PK', 'OK'}
Dec  3 02:02:47 hpt6 2021-12-03 02:02:47,841 DEBUG [hpt] initFile.txt returned but not all tests passed:#012ECC Error: gateway_mfr test finished
 with error, {"result": "fail", "tests": [{"output": "ok", "result": "pass", "test": "serial"}, {"output": "unlocked", "result": "fail", "test":
 "zone_locked(data)"}, {"output": "unlocked", "result": "fail", "test": "zone_locked(config)"}, {"output": "invalid slot: 0", "result": "fail", 
"test": "slot_config(0..=15, ecc)"}, {"output": "invalid slot: 3", "result": "fail", "test": "key_config(0..=15, ecc)"}, {"output": "ecc608 erro
r\n\nCaused by:\n    ecc error ExecError", "result": "fail", "test": "miner_key(0)"}, {"output": "ecc608 error\n\nCaused by:\n    ecc error Exec
Error", "result": "fail", "test": "sign(0)"}, {"output": "ecc608 error\n\nCaused by:\n    ecc error ExecError", "result": "fail", "test": "ecdh(
0)"}]}#012BN Error: Env var BALENA_DEVICE_NAME_AT_INIT not set#012LOR Error: False#012OK Error: gateway_mfr exited with a non-zero status#012PK 
Error: gateway_mfr exited with a non-zero status#012PF Error: False
Dec  3 02:02:47 hpt6 2021-12-03 02:02:47,842 WARNI [hpt] , retrying in 10 seconds...
shawaj commented 2 years ago

My vote would be to return to using the hm-gwmfr container until this can be resolved @marvinmarnold @vpetersson because the new gateway-mfr-rs has caused nothing but trouble IMO.

shawaj commented 2 years ago

@kashifpk I wonder if the issue could be something to do with this:

https://github.com/NebraLtd/hm-diag/blob/8430efdddc5710ccfb65d7b1d7a2bf3a4d212817/hw_diag/app.py#L22-L25

Or perhaps here:

https://github.com/NebraLtd/hm-pyhelper/blob/8e13f79ea3daba04c250a4d94fc59d6388050e6c/hm_pyhelper/miner_param.py#L94-L109

kashifpk commented 2 years ago

@kashifpk I wonder if the issue could be something to do with this:

@shawaj no idea TBH. I haven't touched the ECC related stuff yet. Will dive into that once I receive the miners (stuck at customs here). One thing that is strange it that on latest master of hm-pyhelper I have the provision_key function at line 74 and not at line 94.

Comparing https://github.com/NebraLtd/hm-pyhelper/blob/master/hm_pyhelper/miner_param.py with attached version that I have the code is a bit different too. image

Checking why the difference is there in master.

shawaj commented 2 years ago

I'm confused as to what you are saying?

hm-pyhelper master has it at line 94 where I'm looking?

kashifpk commented 2 years ago

yes, this is crazy hm-pyhelper I have has commit from Nov 9 and doing a git pull origin master says everything is up to date. Any way, will just do a fresh clone of the repo.

shawaj commented 2 years ago

@kashifpk looking at the error above, it seems to me as if the test is failing because it hasn't been programmed, but in provision_key() it should notice that and then provision the key.

I guess this logic is not working correctly for some reason

kashifpk commented 2 years ago

Fresh clone got me the latest changes. so we have the same code. I see the code. May be we need some way of returning the output of https://github.com/NebraLtd/hm-pyhelper/blob/8e13f79ea3daba04c250a4d94fc59d6388050e6c/hm_pyhelper/miner_param.py#L103 to see why it's actually failing?

shawaj commented 2 years ago

To me it looks like initFile.txt is generating before provision_key even fires (perhaps it's bypassing it altogether)

The response in that file is what I would expect to see on an ECC that hasn't been provisioned

vpetersson commented 2 years ago

To me it looks like initFile.txt is generating before provision_key even fires (perhaps it's bypassing it altogether)

The response in that file is what I would expect to see on an ECC that hasn't been provisioned

No, the provisioning is executed before the app server is even started. That said, we might need to validate that this logic in fact works properly.

shawaj commented 2 years ago

we might need to validate that this logic in fact works properly

It definitely doesn't. I can tell you that for free 😉

The provisioning works if you manually run it but it doesn't if you just turn on a device. Same happens on RockPi. I thought it was RockPi specific but clearly not.

kashifpk commented 2 years ago

Update:

On production we have devices with the same firmware and hm-diag version:

{"diagnostics_version":"72f141f","firmware_version":"2021.11.30.1"}

that we have on the manufacturing miners. On testnet the versions are newer but that is just because of the new PR template merge.

{"diagnostics_version":"68d4984","firmware_version":"2021.11.30.1-1"}

And none of these give any errors. So this should help isolate the issue because these devices get updated while in manufacturing before on-boarding we do trying to generate keys etc?

ilyastrodubtsev commented 2 years ago

@kashifpk I completely agree. we have one of the devices on the testnet. on a newer firmware version than in production. i am trying to test on my miner. but it looks like my flash card has finally died. And I cannot now flash the miner for the penultimate version.

posterzh commented 2 years ago

I think the reason may be gateway-mfr executable is built for target: arm-unknown-linux-gnueabihf. The previous target was target: aarch64-unknown-linux-musl. I will check this in more detail.

posterzh commented 2 years ago

I think the reason may be gateway-mfr executable is built for target: arm-unknown-linux-gnueabihf. The previous target was target: aarch64-unknown-linux-musl. I will check this in more detail.

This is not the case.

shawaj commented 2 years ago

@kashifpk if the key is already provisioned this error won't occur.

All the testnet devices we programmed using the hm-gwmfr container that we used to use and so will pass the tests because they have been correctly provisioned already.

The issue is that the provision_key() new method is never firing or the logic is wrong and therefore the ECC is never provisioned and fails the tests.

This will not be reproducible on a device that already has a provisioned key (which all testnet devices and all live production devices do)

kashifpk commented 2 years ago

@shawaj seems like we get this (or at least similar) error on some of the devices in production. @salmanfarisvp identified some devices and we do see this on those devices too. You can follow the discussion regarding that in slack.

shawaj commented 2 years ago

As I said above, in my opinion we should revert to hm-gwmfr container until this can be solved. As this is now holding up production unnecessarily.

kashifpk commented 2 years ago

But in principle yes I agree that this is something related to code that runs when the device is not yet onboarded.

shawaj commented 2 years ago

@shawaj seems like we get this (or at least similar) error on some of the devices in production. @salmanfarisvp identified some devices and we do see this on those devices too. You can follow the discussion regarding that in slack.

That's unrelated IMO. Perhaps a failed ECC key or similar.

But also this logic is totally untested and has caused nothing but issues. We had a working way with hm-gwmfr and we broke it for no apparent reason.

Let's revert to that and can solve this later.

shawaj commented 2 years ago

@kashifpk do you have a link to the slack conversation you mentioned with @salmanfarisvp ?

vpetersson commented 2 years ago

@shawaj seems like we get this (or at least similar) error on some of the devices in production. @salmanfarisvp identified some devices and we do see this on those devices too. You can follow the discussion regarding that in slack.

That's unrelated IMO. Perhaps a failed ECC key or similar.

But also this logic is totally untested and has caused nothing but issues. We had a working way with hm-gwmfr and we broke it for no apparent reason.

Let's revert to that and can solve this later.

Reverting this is far from trivial.

kashifpk commented 2 years ago

@kashifpk do you have a link to the slack conversation you mentioned with @salmanfarisvp ?

@shawaj https://nebraltd.slack.com/archives/C024BNQ1Y6T/p1638532099453300

shawaj commented 2 years ago

The only other thing I can think is, will this logic definitely work:

https://github.com/NebraLtd/hm-pyhelper/blob/8e13f79ea3daba04c250a4d94fc59d6388050e6c/hm_pyhelper/miner_param.py#L94-L109

Compared to what it was before: https://github.com/NebraLtd/hm-pyhelper/blob/407402e9ca43c26e268eb24ea559d08a5b9dc080/hm_pyhelper/miner_param.py#L94-L120

shawaj commented 2 years ago

Reverting this is far from trivial.

Why did we break it in the first place @vpetersson ?

We need to do end to end tests of stuff before winging in changes to manufacturing images.

In any case, it's pretty trivial.... We can just revert the container for diag to an old version as well as adding back in gwmfr

vpetersson commented 2 years ago

No, because this would break hotspot production tool. But we just found something that looks promising.

shawaj commented 2 years ago

No, because this would break hotspot production tool. But we just found something that looks promising.

We can revert that too...

But in any case this is just more evidence of needing automated end to end tests of this stuff before sending new images to Sunsoar. We can't afford downtime like this now, and next year even more so

vpetersson commented 2 years ago

We can't use the Erlang based version for RockPi regardless, so we need to sort out regardless. Looks like we have found the root cause as well.

shawaj commented 2 years ago

Why can't we use the Erlang version for RockPi?

But in any case looks like it's sorted now. Fingers crossed.

shawaj commented 2 years ago

@kashifpk maybe we should add in a log message of "ECC provisioning has run" or if it skips "ECC tests passed, skipping provisioning" (in hm-pyhelper)

shawaj commented 2 years ago

Potentially closed by https://github.com/NebraLtd/helium-miner-software/commit/9b9f232ea17511af8f50a4412c81c39869f59310

kashifpk commented 2 years ago

@kashifpk maybe we should add in a log message of "ECC provisioning has run" or if it skips "ECC tests passed, skipping provisioning" (in hm-pyhelper)

Sure will get this implemented next week.

shawaj commented 2 years ago

Error from completely unprovisioned key:

ECC Error: gateway_mfr test finished
 with error, 
{
  "result": "fail", 
  "tests": [
    {
      "output": "ok", 
      "result": "pass", 
      "test": "serial"
    }, 
    {
      "output": "unlocked", 
      "result": "fail", 
      "test": "zone_locked(data)"
    }, 
    {
      "output": "unlocked", 
      "result": "fail", 
      "test": "zone_locked(config)"
    }, 
    {
      "output": "invalid slot: 0", 
      "result": "fail", 
      "test": "slot_config(0..=15, ecc)"
    }, 
    {
      "output": "invalid slot: 3", 
      "result": "fail", 
      "test": "key_config(0..=15, ecc)"
    }, 
    {
      "output": "ecc608 error\n\nCaused by:\n    ecc error ExecError", 
      "result": "fail", 
      "test": "miner_key(0)"
    }, 
    {
      "output": "ecc608 error\n\nCaused by:\n    ecc error ExecError", 
      "result": "fail", 
      "test": "sign(0)"
    }, 
    {
      "output": "ecc608 error\n\nCaused by:\n    ecc error ExecError", 
      "result": "fail", 
      "test": "ecdh(0)"
    }
  ]
}

Error from device with errored provisioning (not a compact key):

root@f2bb3b2936d7:/opt/python-dependencies/hm_pyhelper# ./gateway_mfr test
{
  "result": "fail",
  "tests": [
    {
      "output": "ok",
      "result": "pass",
      "test": "serial"
    },
    {
      "output": "ok",
      "result": "pass",
      "test": "zone_locked(data)"
    },
    {
      "output": "ok",
      "result": "pass",
      "test": "zone_locked(config)"
    },
    {
      "output": "ok",
      "result": "pass",
      "test": "slot_config(0..=15, ecc)"
    },
    {
      "output": "ok",
      "result": "pass",
      "test": "key_config(0..=15, ecc)"
    },
    {
      "output": "decode error\n\nCaused by:\n    not a compact key",
      "result": "fail",
      "test": "miner_key(0)"
    },
    {
      "output": "decode error\n\nCaused by:\n    not a compact key",
      "result": "fail",
      "test": "sign(0)"
    },
    {
      "output": "decode error\n\nCaused by:\n    not a compact key",
      "result": "fail",
      "test": "ecdh(0)"
    }
  ]
}

For the first issue, this just means the key has never been provisioned at all so a standard provisioning should fix it.

For the second issue, it is possible to fix it by moving it to the hm-gwmfr fleet in balena and running:

/opt/gateway_mfr/bin/gateway_mfr ecc provision_onboard

We should document this somewhere. This can also be used for rekeying a miner that the person has lost the key for

shawaj commented 2 years ago

ref https://github.com/helium/gateway-mfr-rs/issues/8

shawaj commented 2 years ago

This error is a timeout error - it can be caused either by an ECC module board that has come loose (on outdoor) or a loose CM3 module / daugterboard - or alternatively (in rare cases) an actual failed ECC

2021-12-04 01:06:24,714 - [INFO] - hm_pyhelper.miner_param - (miner_param.py).run_gateway_mfr -- /opt/python-dependencies/hm_pyhelper/miner_param.py:(41) - gateway_mfr response stdout: b'{\n  "result": "fail",\n  "tests": [\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "serial"\n    },\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "zone_locked(data)"\n    },\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "zone_locked(config)"\n    },\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "slot_config(0..=15, ecc)"\n    },\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "key_config(0..=15, ecc)"\n    },\n    {\n      "output": "ecc608 error\\n\\nCaused by:\\n    timeout/retry error",\n      "result": "fail",\n      "test": "miner_key(0)"\n    },\n    {\n      "output": "ecc608 error\\n\\nCaused by:\\n    timeout/retry error",\n      "result": "fail",\n      "test": "sign(0)"\n    },\n    {\n      "output": "ecc608 error\\n\\nCaused by:\\n    timeout/retry error",\n      "result": "fail",\n      "test": "ecdh(0)"\n    }\n  ]\n}\n'
 diagnostics  INFO:hm_pyhelper.miner_param:gateway_mfr response stdout: b'{\n  "result": "fail",\n  "tests": [\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "serial"\n    },\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "zone_locked(data)"\n    },\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "zone_locked(config)"\n    },\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "slot_config(0..=15, ecc)"\n    },\n    {\n      "output": "timeout/retry error",\n      "result": "fail",\n      "test": "key_config(0..=15, ecc)"\n    },\n    {\n      "output": "ecc608 error\\n\\nCaused by:\\n    timeout/retry error",\n      "result": "fail",\n      "test": "miner_key(0)"\n    },\n    {\n      "output": "ecc608 error\\n\\nCaused by:\\n    timeout/retry error",\n      "result": "fail",\n      "test": "sign(0)"\n    },\n    {\n      "output": "ecc608 error\\n\\nCaused by:\\n    timeout/retry error",\n      "result": "fail",\n      "test": "ecdh(0)"\n    }\n  ]\n}\n'
 diagnostics  2021-12-04 01:06:24,715 - [INFO] - hm_pyhelper.miner_param - (miner_param.py).run_gateway_mfr -- /opt/python-dependencies/hm_pyhelper/miner_param.py:(43) - gateway_mfr response stderr: b''
 diagnostics  INFO:hm_pyhelper.miner_param:gateway_mfr response stderr: b''
 diagnostics  2021-12-04 01:06:24,751 - [ERROR] - hm_pyhelper.miner_param - (miner_param.py).run_gateway_mfr -- /opt/python-dependencies/hm_pyhelper/miner_param.py:(47) - gateway_mfr exited with a non-zero status
 diagnostics  Traceback (most recent call last):
 diagnostics    File "/opt/python-dependencies/hm_pyhelper/miner_param.py", line 36, in run_gateway_mfr
 diagnostics      run_gateway_mfr_result = subprocess.run(
 diagnostics    File "/usr/local/lib/python3.10/subprocess.py", line 524, in run
 diagnostics      raise CalledProcessError(retcode, process.args,
 diagnostics  subprocess.CalledProcessError: Command '['/opt/python-dependencies/hm_pyhelper/gateway_mfr', 'provision']' returned non-zero exit status 1.
 diagnostics  ERROR:hm_pyhelper.miner_param:gateway_mfr exited with a non-zero status
 diagnostics  Traceback (most recent call last):
 diagnostics    File "/opt/python-dependencies/hm_pyhelper/miner_param.py", line 36, in run_gateway_mfr
 diagnostics      run_gateway_mfr_result = subprocess.run(
 diagnostics    File "/usr/local/lib/python3.10/subprocess.py", line 524, in run
 diagnostics      raise CalledProcessError(retcode, process.args,
 diagnostics  subprocess.CalledProcessError: Command '['/opt/python-dependencies/hm_pyhelper/gateway_mfr', 'provision']' returned non-zero exit status 1.
 diagnostics  [2021-12-04 01:06:24 +0000] [8] [ERROR] Exception in worker process
 diagnostics  Traceback (most recent call last):
 diagnostics    File "/opt/python-dependencies/hm_pyhelper/miner_param.py", line 36, in run_gateway_mfr
 diagnostics      run_gateway_mfr_result = subprocess.run(
 diagnostics    File "/usr/local/lib/python3.10/subprocess.py", line 524, in run
 diagnostics      raise CalledProcessError(retcode, process.args,
 diagnostics  subprocess.CalledProcessError: Command '['/opt/python-dependencies/hm_pyhelper/gateway_mfr', 'provision']' returned non-zero exit status 1.
 diagnostics  
 diagnostics  During handling of the above exception, another exception occurred:
 diagnostics  
 diagnostics  Traceback (most recent call last):
 diagnostics    File "/opt/python-dependencies/gunicorn/arbiter.py", line 589, in spawn_worker
 diagnostics      worker.init_process()
 diagnostics    File "/opt/python-dependencies/gunicorn/workers/base.py", line 134, in init_process
 diagnostics      self.load_wsgi()
 diagnostics    File "/opt/python-dependencies/gunicorn/workers/base.py", line 146, in load_wsgi
 diagnostics      self.wsgi = self.app.wsgi()
 diagnostics    File "/opt/python-dependencies/gunicorn/app/base.py", line 67, in wsgi
 diagnostics      self.callable = self.load()
 diagnostics    File "/opt/python-dependencies/gunicorn/app/wsgiapp.py", line 58, in load
 diagnostics      return self.load_wsgiapp()
 diagnostics    File "/opt/python-dependencies/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
 diagnostics      return util.import_app(self.app_uri)
 diagnostics    File "/opt/python-dependencies/gunicorn/util.py", line 359, in import_app
 diagnostics      mod = importlib.import_module(module)
 diagnostics    File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
 diagnostics      return _bootstrap._gcd_import(name[level:], package, level)
 diagnostics    File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
 diagnostics    File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
 diagnostics    File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
 diagnostics    File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
 diagnostics    File "<frozen importlib._bootstrap_external>", line 883, in exec_module
 diagnostics    File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
 diagnostics    File "/opt/python-dependencies/hw_diag/__init__.py", line 3, in <module>
 diagnostics      wsgi_app = get_app(__name__)
 diagnostics    File "/opt/python-dependencies/hw_diag/app.py", line 30, in get_app
 diagnostics      perform_key_provisioning()
 diagnostics    File "/opt/python-dependencies/decorator.py", line 232, in fun
 diagnostics      return caller(func, *(extras + args), **kw)
 diagnostics    File "/opt/python-dependencies/retry/api.py", line 73, in retry_decorator
 diagnostics      return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
 diagnostics    File "/opt/python-dependencies/retry/api.py", line 33, in __retry_internal
 diagnostics      return f()
 diagnostics    File "/opt/python-dependencies/hw_diag/app.py", line 24, in perform_key_provisioning
 diagnostics      if not provision_key():
 diagnostics    File "/opt/python-dependencies/hm_pyhelper/miner_param.py", line 103, in provision_key
 diagnostics      gateway_mfr_result = run_gateway_mfr(["provision"])
 diagnostics    File "/opt/python-dependencies/hm_pyhelper/lock_singleton.py", line 71, in wrapper_lock_ecc
 diagnostics      raise ex
 diagnostics    File "/opt/python-dependencies/hm_pyhelper/lock_singleton.py", line 60, in wrapper_lock_ecc
 diagnostics      raise ex
 diagnostics    File "/opt/python-dependencies/hm_pyhelper/lock_singleton.py", line 57, in wrapper_lock_ecc
 diagnostics      value = func(*args, **kwargs)
 diagnostics    File "/opt/python-dependencies/hm_pyhelper/miner_param.py", line 48, in run_gateway_mfr
 diagnostics      raise ECCMalfunctionException(err_str).with_traceback(e.__traceback__)
 diagnostics    File "/opt/python-dependencies/hm_pyhelper/miner_param.py", line 36, in run_gateway_mfr
 diagnostics      run_gateway_mfr_result = subprocess.run(
 diagnostics    File "/usr/local/lib/python3.10/subprocess.py", line 524, in run
 diagnostics      raise CalledProcessError(retcode, process.args,
 diagnostics  hm_pyhelper.exceptions.ECCMalfunctionException: gateway_mfr exited with a non-zero status
 diagnostics  [2021-12-04 01:06:24 +0000] [8] [INFO] Worker exiting (pid: 8)
 diagnostics  [2021-12-04 01:06:25 +0000] [1] [INFO] Shutting down: Master
 diagnostics  [2021-12-04 01:06:25 +0000] [1] [INFO] Reason: Worker failed to boot.
shawaj commented 2 years ago

this does work to rekey a miner fyi @vpetersson :

root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc test
+--------------------+------+
|        name        |result|
+--------------------+------+
|     serial_num     |  ok  |
|{zone_locked,config}|  ok  |
| {zone_locked,data} |  ok  |
|    slot_config     |  ok  |
|     key_config     |  ok  |
|     miner_key      |  ok  |
+--------------------+------+

root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc onboarding
112XcXrknVtvmQ3DMGcKdPqH9f8Yj3GswmJiUSR8YWNVkxfFNnsN
root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc provision_onboard
11m3KCgNMEqdcT2cHQL1qQPC3ZC5d8fJJsXu4h6v2Za8Av4UbHL
root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc test
+--------------------+------+
|        name        |result|
+--------------------+------+
|     serial_num     |  ok  |
|{zone_locked,config}|  ok  |
| {zone_locked,data} |  ok  |
|    slot_config     |  ok  |
|     key_config     |  ok  |
|     miner_key      |  ok  |
+--------------------+------+

root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc onboarding
11m3KCgNMEqdcT2cHQL1qQPC3ZC5d8fJJsXu4h6v2Za8Av4UbHL
root@9869a5b:/opt/gateway_mfr# 

Can also rekey a miner with new gateway-mfr-rs using gateway_mfr key --generate <slot>

shawaj commented 2 years ago

Re the "not a compact key" error... It seems this was a legacy issue with gateway-mfr-rs..

https://github.com/helium/gateway-mfr-rs/issues/8#issuecomment-986818080

ECC corruption seems very unlikely but there definitely were older versions of this cli that allowed non compact keys.

kashifpk commented 2 years ago

Closing this issue as this is now running in manufacturing for a couple of days without problems.

vpetersson commented 2 years ago

this does work to rekey a miner fyi @vpetersson :

root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc test
+--------------------+------+
|        name        |result|
+--------------------+------+
|     serial_num     |  ok  |
|{zone_locked,config}|  ok  |
| {zone_locked,data} |  ok  |
|    slot_config     |  ok  |
|     key_config     |  ok  |
|     miner_key      |  ok  |
+--------------------+------+

root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc onboarding
112XcXrknVtvmQ3DMGcKdPqH9f8Yj3GswmJiUSR8YWNVkxfFNnsN
root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc provision_onboard
11m3KCgNMEqdcT2cHQL1qQPC3ZC5d8fJJsXu4h6v2Za8Av4UbHL
root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc test
+--------------------+------+
|        name        |result|
+--------------------+------+
|     serial_num     |  ok  |
|{zone_locked,config}|  ok  |
| {zone_locked,data} |  ok  |
|    slot_config     |  ok  |
|     key_config     |  ok  |
|     miner_key      |  ok  |
+--------------------+------+

root@9869a5b:/opt/gateway_mfr# /opt/gateway_mfr/bin/gateway_mfr ecc onboarding
11m3KCgNMEqdcT2cHQL1qQPC3ZC5d8fJJsXu4h6v2Za8Av4UbHL
root@9869a5b:/opt/gateway_mfr# 

Can also rekey a miner with new gateway-mfr-rs using gateway_mfr key --generate <slot>

Good find. Ping @marvinmarnold / @robputt

shawaj commented 2 years ago

Will add a doc in product-management repo so we don't lose this info