gebn / bmc

A pure Go IPMI v2.0 remote console.
GNU Lesser General Public License v3.0
78 stars 20 forks source link

Unable to connect to a new BMC #62

Closed lukeyeager closed 7 months ago

lukeyeager commented 9 months ago

I suspect something similar to https://github.com/gebn/bmc/issues/56 is happening - I'm trying to connect to a new BMC and am unable to connect using this library (but I can connect with ipmitool). I suspect they started using some less common feature of IPMI which is not yet supported by this library.

The symptom is timeouts, unfortunately.

$ go run ./cmd/describe/... myhost --username ${USER} --password "${PASSWORD}"
2023/12/05 11:35:20 connected to 192.168.XX.YY:623 over IPMI v2.0
2023/12/05 11:35:30 failed to get presence pong capabilities: read udp 192.168.YY.ZZ:57472->192.168.XX.YY:623: i/o timeout
2023/12/05 11:35:30 failed to get channel auth capabilities: context deadline exceeded
2023/12/05 11:35:30 failed to get system GUID: context deadline exceeded
2023/12/05 11:35:30 context deadline exceeded
exit status 1

This works fine: ipmitool -N2 -R1 -I lanplus -L user -U "${USER}" -P "${PASSWORD}" -H myhost -C 17 sdr list.

My custom prometheus exporter times out trying to download the SDR repo:

2023/12/04 21:27:23 myhost: download SDR repo error: context deadline exceeded

The sequence of calls is similar to cmd/describe/: bmc.DialV2(), transport.NewV2Session(), and then bmc.RetrieveSDRRepository which times out.

fialhopm commented 9 months ago

bmc.RetrieveSDRRepository times out because it repeatedly fails to decode the AES-128-CBC confidentiality layer of a subset of the getSDRCmd responses. In those cases, it's returning an error here because the number of confidentiality pad bytes is 16.

Even though the documentation states:

don't reveal the difference between "decryption failed" and "invalid padding" errors

I suspect that this is a true "invalid padding" error because:

  1. The values of the 16 pad bytes pass the check defined in table 13-29 of the spec. In other words, the value of the i-th pad byte is i (including the 16-th byte).
  2. bmc.RetrieveSDRRepository does not return an error if we update AES128CBC.DecodeFromBytes to allow padBytes to be <= 16. Instead, it returns an empty SDRRepository, but that's because the getSDRCmd responses only contain ipmi.LayerTypeSDR and not ipmi.LayerTypeFullSensorRecord.
gebn commented 8 months ago

Thanks for digging into this!

Failing presence pong and channel auth capabilities is fairly fundamental. This is prior to session establishment and so any kind of encryption. It should be possible to do a direct packet diff between ipmitool and this library for those messages to see what's going on.

The RetrieveSDRRepository() issue sounds like a combination of a non-compliant 16-byte pad (max is 15 for AES), and the SDR repo not containing any FSRs, which is the only variant we currently support. Printing the SDR.Type field will show what each record is, and from there a decoder could be written, assuming it's not OEM-defined. ipmitool can probably also provide type information.

Even though the documentation states:

Note the bit before - "care should be taken when displaying this": the library returns the granular error, but it ideally should be generalised before being bubbled up to anyone/anything else. Frankly IPMI security is poor enough that this isn't worth worrying about.

lukeyeager commented 8 months ago

I realized that ipmitool can't do DCMI ping/pong discovery on these nodes, either:

# old BMC, works
$ ipmitool -N2 -R1 -I lanplus -L user -U "${user}" -E -H oldbmc -C 17 dcmi oob_discover -v
Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20
Error response 0xc1 from Get PICMG Properities
Running Get VSO Capabilities my_addr 0x20, transit 0, target 0x20
Invalid completion code received: Invalid command
Discovered IPMB address 0x0
Received IPMI/RMCP response packet: IPMI Supported

# new BMC, fails
$ ipmitool -N2 -R1 -I lanplus -L user -U "${user}" -E -H newbmc -C 17 dcmi oob_discover -v
Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20
Error response 0xc1 from Get PICMG Properities
Running Get VSO Capabilities my_addr 0x20, transit 0, target 0x20
Invalid completion code received: Invalid command
Discovered IPMB address 0x0

So, that was a red herring. We don't need ping/pong to work for our exporter.

lukeyeager commented 8 months ago

I hacked through cmd/describe/ to see which commands which work and which don't.

OK   bmc.Dial
FAIL presencePing
OK   machine.GetChannelAuthenticationCapabilities
FAIL machine.GetSystemGUID
OK   machine.NewSession
OK   sess.GetDeviceID
OK   sess.GetChassisStatus
FAIL bmc.RetrieveSDRRepository
FAIL getDCMICaps
OK   dcmi.GetSensorInfo

When I comment everything out from cmd/describe/ which doesn't work on the new BMC and compare the resulting output between the old and new BMCs, I notice some differences. Here is the sanitized diff:

--- oldbmc.txt  2024-01-03 08:53:38.319268000 -0800
+++ newbmc.txt  2024-01-03 08:53:45.588590000 -0800
@@ -1 +1 @@
-2024/01/03 08:53:35 connected to 192.168.XX.YY:623 over IPMI v2.0
+2024/01/03 08:53:45 connected to 192.168.XX.ZZ:623 over IPMI v2.0
@@ -7 +7 @@
-       Per-message auth:   true
+       Per-message auth:   false
@@ -10 +10 @@
-       Null usernames:     true
+       Null usernames:     false
@@ -14,4 +14,4 @@
-       ID:                 XX
-       Revision:           XX
-       Manufacturer:       XX
-       Product:            XX
+       ID:                 YY
+       Revision:           YY
+       Manufacturer:       YY
+       Product:            YY
@@ -20 +20 @@
-       Firmware (aux):     XX
+       Firmware (aux):     YY
@@ -25 +25 @@
-       Identification:     0(Off)
+       Identification:     3(Unknown)
@@ -30 +30 @@
-2024/01/03 08:53:38 failed to get DCMI sensor info: read udp 192.168.AA.BB:44792->192.168.XX.YY:623: i/o timeout
+DCMI Sensors:

I wonder if that "per-message auth" field is relevant?

fialhopm commented 8 months ago

The RetrieveSDRRepository() issue sounds like a combination of a non-compliant 16-byte pad (max is 15 for AES), and the SDR repo not containing any FSRs, which is the only variant we currently support. Printing the SDR.Type field will show what each record is, and from there a decoder could be written, assuming it's not OEM-defined. ipmitool can probably also provide type information.

Thank you for these pointers! The issue turned out to be threefold:

  1. The 16-byte pad. Would it be reasonable to allow this in AES128CBC.DecodeFromBytes? The equivalent ipmitool code doesn't seem to enforce a max of 15. I also noticed this comment, which claims that the openssl library "adds a whole block of padding if the input data is perfectly aligned".

  2. The use of StringEncodingUnicode by the BMC, which caused FSR decoding to fail.

  3. Receiving a GetSDRRepositoryInfo response after sending a GetSDR request. This occurs when the initial GetSDRRepositoryInfo times out, but succeeds on retry. Shortly after, while iterating through the SDRs, we get the GetSDRRepositoryInfo response again (I determined this by printing the decrypted bytes), which gets parsed into a GetSDRRsp with an invalid next record ID. I solved this by doubling the transport timeout, which prevents the initial GetSDRRepositoryInfo timeout. Would it be ok to make this value configurable? Additionally, would it be reasonable to add a unique sequence number to each request within a V2Session (here) and then assert on it when decoding the response?

I'd be happy to work on a PR for any or all of these changes.

gebn commented 8 months ago

Nice work! A PR would be fantastic.

  1. Let's only allow a 16th pad byte and not accept all the way up to 0xff. AES in CBC mode normally requires a minimum of one pad byte, which will be why OpenSSL adds a full block of padding to already-aligned messages. Unfortunately, IPMI decided to approach AES in its own way, probably under the guise of efficiency, hence the special case.
  2. I'd hoped this would never come up. The spec is ambiguous, so anything we do may start failing silently for others. How has this vendor implemented "unicode" here? Hopefully it's something like UTF-8.
  3. Customisable timeouts and correct use of sequence numbers sounds sensible to me; I'm not sure why I didn't implement the latter from the get-go.
fialhopm commented 8 months ago

The spec is ambiguous, so anything we do may start failing silently for others. How has this vendor implemented "unicode" here? Hopefully it's something like UTF-8.

Apologies for the delay. That makes sense, and it is UTF-8. Strangely, running ipmitool sdr list full does print the sensor ID strings. This seems to be the only place where it inspects bits 7:6 of "ID String Type/Length Code", and it does return a hex dump as noted here. However, this is probably only used to decode FRU Device Locator records. For Full Sensor Records (FSRs), the tool appears to ignore the encoding. Given this, would it be ok to add a UTF-8 decoder for StringEncodingUnicode? And perhaps log a warning?

Separately, I ran into an issue where the BMC returns a malformed SDR that causes RetrieveSDRRepository to fail due to a decoding error. This SDR is not an FSR, so I worked around this by changing walkSDRs to only read the SDR's header at first, and then only request the rest of the SDR if its type is FSR. This allows us to skip the bad SDR by getting its header's next record ID without having to decode the rest. It required implementing ReserveSDRRepositoryCmd. Could I submit a PR for this change too?

gebn commented 8 months ago

ipmi_sdr_read_sensor_value() appears to treat the bytes more like ASCII than UTF-8 (num bytes == num chars). How about we follow OpenIPMI and decode StringEncodingUnicode as StringEncoding8BitAsciiLatin1? Some users of the library might decode lots of these strings, so I'm reluctant to log a warning. I think it's fine to just make the ASCII assumption.

For the walkSDRs() issue, this should be the current behaviour! The current implementation should not attempt to parse the SDR if the RecordType is unrecognised (i.e. anything other than FSR currently), so we should skip non-FSR records as you describe. How does parsing just the header, which is already what we are branching on, solve the problem?

fialhopm commented 8 months ago

How about we follow OpenIPMI and decode StringEncodingUnicode as StringEncoding8BitAsciiLatin1?

That sounds great.

How does parsing just the header, which is already what we are branching on, solve the problem?

Thanks for pointing this out! I was wrong when I said that the error occurred when decoding "the rest" of the SDR. The decoding error happens here. But, if I change GetSDRReq to only request the first 5 bytes, I get a response with valid Get SDR Response and SDR Header layers. I tried this because I noticed that ipmitool handles it similarly: it requests and decodes the header, then requests the next 32 bytes but fails to decode them, then skips to the next SDR (here).

gebn commented 8 months ago

The decoding error happens here.

Strange - the Get SDR response should always contain the next RecordID, even if the requested SDR is too long for the BMC to send in one go. Can I confirm the BMC is returning a normal completion code (0x0), and not 0xca or 0xff?

It sounds like we don't have much choice but to do a two-part read, either always, or in response to this kind of non-compliant behaviour. As much as I dislike the idea of doing it all the time and sending twice as many commands to read an SDR Repo, I think it is the most pragmatic solution. This should not necessitate implementing Reserve SDR Repository, however it may make sense to add at the same time, especially if your BMC's SDRs are long enough to require it.

fialhopm commented 8 months ago

Can I confirm the BMC is returning a normal completion code (0x0), and not 0xca or 0xff?

Good question. I hadn't noticed this before, but it's returning 0xcc ("Invalid data field in Request"). I confirmed that I get the same code using ipmitool.

This should not necessitate implementing Reserve SDR Repository, however it may make sense to add at the same time, especially if your BMC's SDRs are long enough to require it

The SDRs are not long enough to require partial reads, but the reservation ID is needed if we skip the header when requesting the rest of the SDR (i.e., if we set Offset to 0x05 in the second request). Do you think this approach makes sense? It seems that it could come in handy if there's ever a use case for partial reads.

gebn commented 8 months ago

it's returning 0xcc ("Invalid data field in Request")

Assuming you have a support contract, it would be worth asking the vendor what they are expecting, as we are sending a valid request by the v2.0 spec. Does IPMItool give the same if forced to do v1.5?

the reservation ID is needed if we skip the header when requesting the rest of the SDR (i.e., if we set Offset to 0x05 in the second request)

That's valid, my thinking was it's cheaper to re-read the first 5 bytes than do two more RTTs for the reservation commands!

fialhopm commented 8 months ago

it would be worth asking the vendor what they are expecting

That makes sense. I reached out to the vendor.

Does IPMItool give the same if forced to do v1.5?

I haven't been able to get a v1.5 command to work. I tried ipmitool -N2 -R1 -I lan -L user -U "${USER}" -A PASSWORD -P "${PASSWORD}" -H myhost sdr list as well as all other types of authentication (NONE, MD3, MD5, and OEM), but they all fail with:

Authentication type <authentication_type> not supported
Error: Unable to establish LAN session
Error: Unable to establish IPMI v1.5 / RMCP session
lukeyeager commented 7 months ago

Between https://github.com/gebn/bmc/pull/63, https://github.com/gebn/bmc/pull/64, https://github.com/gebn/bmc/pull/65, and https://github.com/gebn/bmc/pull/66, we can now connect to the BMC.