Danielv123 / serverManager

IPMI server manager build for Dell 12th gen servers
56 stars 10 forks source link

R210 II and iDRAC6 2.90 #2

Open chrismast opened 3 years ago

chrismast commented 3 years ago

He, Just tried your iDRAC Sever Monitor and somehow cannot get it to work (anymore).

Background: Server: Dell R210 II iDRAC: v.6 firmware version 2.90 (build 04)(express and enterprise modules installed)

I had it running with iDRAC firmware version 1.90 a while ago, showing temp and 3x fan speed, whereas after upgrading the iDRAC version to the latest 2.90 firmware it somehow does not show anything anymore. I did not change any iDRAC settings in between.

I can run sudo ipmitool -I lanplus -C 3 -H IP -U USER -P PW sensor and it shows me the different sensor outputs (i.e. temp and 3x fan speeds), executing the command from the same VM I have the serverManager installed.

Any advice on what it could be that stops it from working?

The following is the output of sudo ipmitool -I lanplus -C 3 -H IP -U USER -P PW sdr:

Temp | disabled | ns Ambient Temp | 31 degrees C | ok Planar Temp | disabled | ns CMOS Battery | 0x00 | ok ROMB Battery | Not Readable | ns VCORE | 0x00 | ok 0.75 CPU VTT PG | 0x00 | ok 1.8V PG | 0x00 | ok 3.3V PG | 0x00 | ok PSU PG | 0x00 | ok 5V Riser1 PG | 0x00 | ok MEM CPU FAIL | 0x00 | ok VTT CPU FAIL | 0x00 | ok 1.8 PLL CPU PG | 0x00 | ok 1.2 LOM FAIL | 0x00 | ok 1.05V PG | 0x00 | ok 1.2 AUX FAIL | 0x00 | ok Heatsink Pres | 0x00 | ok iDRAC6 Ent PRES | 0x00 | ok USB Cable Pres | 0x00 | ok Riser1 Pres | 0x00 | ok FAN 1 RPM | 6720 RPM | ok FAN 2 RPM | 6720 RPM | ok FAN 3 RPM | 6720 RPM | ok PFault Fail Safe | Not Readable | ns Presence | 0x00 | ok Status | 0x00 | ok Status | 0x00 | ok OS Watchdog | 0x00 | ok SEL | Not Readable | ns Intrusion | 0x00 | ok CPU Temp Interf | Not Readable | ns iDRAC6 Upgrade | Not Readable | ns vFlash | 0x00 | ok DKM Status | 0x00 | ok ECC Corr Err | Not Readable | ns ECC Uncorr Err | Not Readable | ns I/O Channel Chk | Not Readable | ns PCI Parity Err | Not Readable | ns PCI System Err | Not Readable | ns SBE Log Disabled | Not Readable | ns Logging Disabled | Not Readable | ns Unknown | Not Readable | ns CPU Protocol Err | Not Readable | ns CPU Bus PERR | Not Readable | ns CPU Init Err | Not Readable | ns CPU Machine Chk | Not Readable | ns Memory Spared | Not Readable | ns Memory Mirrored | Not Readable | ns Memory RAID | Not Readable | ns Memory Added | Not Readable | ns Memory Removed | Not Readable | ns Memory Cfg Err | Not Readable | ns Mem Redun Gain | Not Readable | ns PCIE Fatal Err | Not Readable | ns Chipset Err | Not Readable | ns Err Reg Pointer | Not Readable | ns Mem ECC Warning | Not Readable | ns Mem CRC Err | Not Readable | ns USB Over-current | Not Readable | ns POST Err | Not Readable | ns Hdwr version err | Not Readable | ns Mem Overtemp | Not Readable | ns Mem Fatal SB CRC | Not Readable | ns Mem Fatal NB CRC | Not Readable | ns OS Watchdog Time | Not Readable | ns Non Fatal PCI Er | Not Readable | ns Fatal IO Error | Not Readable | ns MSR Info Log | Not Readable | ns Temp | disabled | ns

Danielv123 commented 3 years ago

I see you use -C 3 while I don't in the tool. I assume that would be the difference? Do you get any output running without -C 3?

--help describes it as

-C ciphersuite Cipher suite to be used by lanplus interface

If thats the cause I will add an option for it.

chrismast commented 3 years ago

Thanks for your reply Daniel! I just tried running the same command without -C 3 and it gives me the same output, hence both work. The serverMonitor still gives me no data, only shows the Power/Temp/Fan lines but no values below. I also tried to destroy my container, deleted the server.json and recreated it, same result :-(.

Danielv123 commented 3 years ago

Sorry for not following up on this, but I haven't had much time. I'd like to figure out what the issue is, but I can't find a way to reproduce locally. Would it be possible to get access to the docker container so I could add some logging and see what I find?

chrismast commented 3 years ago

He Daniel, no worries, its not super high priority, just curious why its not working as well. How would I open the container up for your access? At the moment I got my system quite locked up and only a few instances are exposed via VPN/Reverse Proxy. Would it be possible to get me some instructions on how to enable further logging? (I am quite familiar with docker and exec)

Danielv123 commented 3 years ago

In the code there is a separate file for the iDRAC calls. I suspect they are getting bad data/some error. Could you add logging to those, specifically the one pulling sensor data and see what it retrieves?On Feb 16, 2021 09:12, chrismast notifications@github.com wrote: He Daniel, no worries, its not super high priority, just curious why its not working as well. How would I open the container up for your access? At the moment I got my system quite locked up and only a few instances are exposed via VPN/Reverse Proxy. Would it be possible to get me some instructions on how to enable further logging? (I am quite familiar with docker and exec)

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or unsubscribe.

chrismast commented 3 years ago

Would you be able to guide on how to achieve above? I am versed in how to use docker and how to get into a docker container and change files, whereas not a full on programmer :-). btw, I tried to run the docker image on another LXC with same result.

Danielv123 commented 3 years ago

CD into backend/src/ipmi.js. The code should be quite clear, getSensors is the most interesting function. Use console.log(variable) to log a variable. After saving and restarting the container it should log every ~ 30 seconds.

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10

From: chrismastmailto:notifications@github.com Sent: Wednesday, February 17, 2021 03:14 To: Danielv123/serverManagermailto:serverManager@noreply.github.com Cc: Danielmailto:danielv@live.no; Commentmailto:comment@noreply.github.com Subject: Re: [Danielv123/serverManager] R210 II and iDRAC6 2.90 (#2)

Would you be able to guide on how to achieve above? I am versed in how to use docker and how to get into a docker container and change files, whereas not a full on programmer :-). btw, I tried to run the docker image on another LXC with same result.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/Danielv123/serverManager/issues/2#issuecomment-780248023, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABT2MLOTURWG7IQG2SRTNILS7MQ6XANCNFSM4XHBE7ZA.

chrismast commented 3 years ago

Thanks! I think I got it down, below the log output. Looks like it pulls the data but then maybe does not display?

`,

dscada-backend@1.0.0 start /usr/src/app, node src/index.js, , [ '/usr/local/bin/node', '/usr/src/app/src/index.js' ], listening on port 8080, Loaded server data from disk, client is subscribing to timer with interval , (node:25) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'length' of undefined, at /usr/src/app/src/index.js:123:67, at Array.map (), at updateServers (/usr/src/app/src/index.js:122:18), at processTicksAndRejections (internal/process/task_queues.js:93:5), at async updateServerLoop (/usr/src/app/src/index.js:164:2), (Use node --trace-warnings ... to show where the warning was created), (node:25) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1), (node:25) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code., [, [, 'Ambient Temp', '30.000',, 'degrees C', 'ok',, 'na', '3.000',, '8.000', '42.000',, '47.000', 'na', ],, [, 'CMOS Battery', '0x0',, 'discrete', '0x0080',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'VCORE', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, '0.75 CPU VTT PG',, '0x0',, 'discrete',, '0x0180',,, ,,, ,,, ,,, ,,, ,, 'na', ],, [, '1.8V PG', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, '3.3V PG', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'PSU PG', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, '5V Riser1 PG', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'MEM CPU FAIL', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'VTT CPU FAIL', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, '1.8 PLL CPU PG',, '0x0',, 'discrete',, '0x0180',,, ,,, ,,, ,,, ,,, ,, 'na', ],, [, '1.2 LOM FAIL', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, '1.05V PG', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, '1.2 AUX FAIL', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'Heatsink Pres',, '0x0',, 'discrete',, '0x0180',,, ,,, ,,, ,,, ,,, ,, 'na', ],, [, 'iDRAC6 Ent PRES',, '0x0',, 'discrete',, '0x0180',,, ,,, ,,, ,,, ,,, ,, 'na', ],, [, 'USB Cable Pres',, '0x0',, 'discrete',, '0x0180',,, ,,, ,,, ,,, ,,, ,, 'na', ],, [, 'Riser1 Pres', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'FAN 1 RPM', '5400.000',, 'RPM', 'ok',, '0.000', '1800.000',, 'na', 'na',, 'na', '30600.000', ],, [, 'FAN 2 RPM', '5400.000',, 'RPM', 'ok',, '0.000', '1800.000',, 'na', 'na',, 'na', '30600.000', ],, [, 'FAN 3 RPM', '5280.000',, 'RPM', 'ok',, '0.000', '1800.000',, 'na', 'na',, 'na', '30600.000', ],, [, 'Presence', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'Status', '0x0',, 'discrete', '0x8080',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'Status', '0x0',, 'discrete', '0x0180',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'OS Watchdog', '0x0',, 'discrete', '0x0080',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'Intrusion', '0x0',, 'discrete', '0x0080',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'vFlash', '0x0',, 'discrete', '0x0080',, 'na', 'na',, 'na', 'na',, 'na', 'na', ],, [, 'DKM Status', '0x0',, 'discrete', '0x0080',, 'na', 'na',, 'na', 'na',, 'na', 'na', ], ],`

chrismast commented 3 years ago

Now another strange thing happened, the server monitor suddenly shows temp and fan speeds. Could it be connected to how you call the server? (i.e. before I called it Dell R210 II; whereas now somehow its only called "Dell").

image

Danielv123 commented 3 years ago

Weird, looks like there might be something funny with my config handling. Also, it looks like you are missing a bunch of sensors, right? Anyways, the next place the data is interpreted is in index.js line 92 – 111.

After that you can see the broadcasted values in the chrome console network tab [cid:image003.png@01D705D4.5B899D70]

The way I pick the values to display is by looking at the “Unit” field. I displøay “degrees C”, “Watts” and “RPM”

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10

From: chrismastmailto:notifications@github.com Sent: Thursday, February 18, 2021 02:39 To: Danielv123/serverManagermailto:serverManager@noreply.github.com Cc: Danielmailto:danielv@live.no; Commentmailto:comment@noreply.github.com Subject: Re: [Danielv123/serverManager] R210 II and iDRAC6 2.90 (#2)

Now another strange thing happens, the server monitor suddenly shows temp and fan speeds. Could it be connected to how you call the server? (i.e. before I called it Dell R210 II; whereas now somehow its only called "Dell").

[image]https://user-images.githubusercontent.com/63285062/108291288-9f980180-71c4-11eb-95d5-6538bdbf8d72.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/Danielv123/serverManager/issues/2#issuecomment-780975808, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABT2MLIGLIBTMHSE7MQRUXLS7RVUVANCNFSM4XHBE7ZA.

ihatemyisp commented 2 years ago

Weird, looks like there might be something funny with my config handling. Also, it looks like you are missing a bunch of sensors, right? Anyways, the next place the data is interpreted is in index.js line 92 – 111. After that you can see the broadcasted values in the chrome console network tab [cid:image003.png@01D705D4.5B899D70] The way I pick the values to display is by looking at the “Unit” field. I displøay “degrees C”, “Watts” and “RPM”

I have managed to get this working with my R210II. After what I'd call extensive testing, it came down to the loop timer causing me issues adding/updating servers.

Couple of notes:

If I start from scratch, update the default "R720 main" to my R210II's details, and delete the other default, it will randomly display one cycle worth of information from my R210II but will then hang (it does not show the previous value). If I inspect servers.json, oddly both sensordataRaw and sensorData are populated. Here is the log.

If I start from scratch, remove both default servers and add my R210, it will hang without ever displaying any data. In this test case, inspecting servers.json reveals that sensordataRaw and sensordata array objects are missing completely. Here is the log and severs.json.

My third time through I updated the default servers to add my 12th gens (620 and 720) and added a new server with my R210 details. While I was adding my R210II, it hung (similar last log entry as below). The servers.json had all three servers, but like scenario 2, my R210 entry was missing the sensordataRaw and sensordata array objects. Initially I just restarted the container and it hung (log). I stopped the container, opened servers.json, added the missing array objects (I left them empty), and restarted the container. Everything is now working as expected.

All of that said, I believe the issue comes down to the fact that the updateServers function timer starts immediately on container start and runs continuously from there. If you go to add/update a server and the timer fires the function, things sometimes get hosed. If you remove all the servers and it fires, things sometimes get hosed. I believe the best solution would be the following:

  1. Remove one of the hardcoded default servers. Only one is really necessary for showing an example.
  2. On first start, check for servers.json prior to starting the timer. If not present, default to showing the admin tab and do not start the timer. If present, continue to step 3.
  3. Do a quick sanity check that sensordata and sensordataRaw array objects exist (can be empty) for each server object prior to starting the timer. If it doesn't pass display the admin tab (see 3 and 4 below). If everything is good, start the timer and continue.
  4. When the admin tab is clicked, or being displayed, pause the timer. Closing it (re)starts the timer.
  5. When adding/updating a server, add it to servers.json with all required objects, including sensordata and sensordataRaw (empty).

In theory, manually creating servers.json with your servers prior to starting the container the first time should let things work from the start. I haven't testing this though.

Unfortunately I know don't know enough JS to really make any of these changes. Much like yourself, I also don't have a lot of free time to learn JS and then make the changes. Hopefully you or someone else might be able to though.

Danielv123 commented 2 years ago

Thanks for good notes. Narrowing it down to the update logic means it should be possible to fix without finding a reliable way to reproduce. Now I just need to find the time :)

ihatemyisp commented 2 years ago

Thanks for good notes. Narrowing it down to the update logic means it should be possible to fix without finding a reliable way to reproduce. Now I just need to find the time :)

Thank you for making this. As far as I can tell, it's been running great since I last launched it.

If you need assistance with testing or what not let me know.

ihatemyisp commented 2 years ago

I'll also note that iDRAC6 is extremely slow to report sensor data in comparison to iDRAC7 or later. The timer fires near immediately after it completes in my case with 3 servers (R620, R720, and R210II).

Perhaps resetting/restarting the timer after it completes so that there is a set delay regardless of how long it takes for the updateServers function to complete?

chrismast commented 2 years ago

In case you guys need further testers, happy to do so as well! I do run 4x R210IIs in my homelab and though currently not using anymore serverManager I can spin it up easily. Did integrate all my servers into openHAB (my guide in the OH Forum) incl. temp and fan speed reporting as well as manual and auto fan control.