Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.63k stars 405 forks source link

Partial data source updates with Boost #2965

Open eschoeller opened 5 years ago

eschoeller commented 5 years ago

I decided to open a separate issue for this. We may have discussed this in the past, I can't recall.

I get several messages like this from boost:

ERROR: /cacti/cacti-1.2.6-prod/rra/AAA.rrd: expected 2 data source readings (got 1) from 1569174932
ERROR: /cacti/cacti-1.2.6-prod/rra/BBB.rrd: expected 2 data source readings (got 1) from 1569175163
ERROR: /cacti/cacti-1.2.6-prod/rra/CCC.rrd: expected 7 data source readings (got 4) from 1569175887

I believe these are messages from rrdtool, indicating that it could not update a row since 1 or more columns were missing. The fact that only a partial amount of data was collected by the poller could be attributed to many things. I'm always trying to make improvements on the reliability of the data acquisition. But these messages seem to indicate to me that the entire row is disregarded. Is that the case? I am curious what you think about in-filling the missing columns with NaNs. This way at least some of the good data can still be stored in the RRD file. But I also understand this could lead to undesirable results. Perhaps in some cases it's better to have no data at all, as opposed to partial data. So, perhaps we could have an option that allows NaNs to be filled in, if the user chooses. I'd still throw a warning indicating that has occurred. And perhaps this is more of a global consideration for Cacti, and not limited to just boost?

cigamit commented 5 years ago

Tell me a little more about these guys. Is it that you changed the data template or has the collector just broken not been able to collect some data.

1) What templates? 2) What data input method?

eschoeller commented 5 years ago

It hits a wide variety of templates. Some SNMP gets, some scripts. I have all of our Linux servers still disabled since the boost issue was causing issues. Seems like a lot from the RFC1213 template (that's a tricky one since some data sources have 25 columns) But most recently the issue has been present in some of the Server Technology scripts, APC InRow units, Power Logic ION 7650 meter, and whatever creates an RRD like "a_active_XXX.rrd". I can get more specifics.

cigamit commented 5 years ago

You need to do better. Come up with a case, upload the template that was involved. I just want to ensure that there is not a second problem.

cigamit commented 5 years ago

Also, for the data sources involved, can you please perform a Data Debug of them and see if they come back clean?

eschoeller commented 5 years ago

Haven't forgotten about this. Here is a list of all the data source file names that have had this issue over the past several days:

active
apache_busy_workers
average_latency
bankcurrent1
branchcurrent1
branchcurrentutil1
cktcurrent1
cktpower1
cktpowerfactor1
cktva1
colorado_degreesf
elec_dist_eff
elec_dist_loss
errors_in
facility_pue
facility_pue_pdu
facility_pue_ups
facility_watts
groupcooldemand
hdd_free
infeedapparent3
infeedcrest3
infeedfactor3
infeedload3
infeedphasecurrent3
infeedphasevoltage3
infeedpower3
infeedvoltage3
iostat_rrqms
iostat_svctm
iostat_wkbs
ipreasmtimeout
linecurrent1
linecurrentutil1
mechanical_pue_pdu
mechanical_watts
moduleenergy1
modulepeakpower1
modulepf1
modulepower1
pcapacity
phaseactivepower1
phaseapparentpower1
phasecrestfactor1
phasecurrent1
phaseenergy1
phasepeakcurrent1
phasepower1
phasepowerfactor1
phasevoltage1
phasevoltagedev1
powerbb
snmpinnosuchnames
snmpinpkts
tcpretranssegs
temperature
total_inrow_watts
total_it_watts
total_it_watts_pdu
total_it_watts_ups
traffic_in
udpnoports
unitcooldemand
unitfanrun3
unitleavefluid
unitrackinlet3
unitthreshreturn
ups1_efficiency
upsc_eff

You'll recognize some of them. Lots of them are from templates I have written. Next step is to manually cross reference those names to which data source templates they belong to.

cigamit commented 4 years ago

Man, I forgot what the remaining cases were I've been away for so long. Will look for your emails.

eschoeller commented 4 years ago

Yeah, me too.

TheWitness commented 1 year ago

I think this one should be resolved now. Let me bump it up in priority. @eschoeller if you are still out there, if you have a test system, upgrade to the latest 1.2.x branch and see if you can reproduce. The poller is now smart enough to work around updates that might be happening at the same time as the poller is running (as of today).

eschoeller commented 1 year ago

Oh, that's interesting. So you think this could have been a race condition of some sort, then? Yes, I am still out here. I don't spend as much time with Cacti as I used to, but we still run it and rely on it very heavily in some parts of our organization. I'll eventually be upgrading. I have a few other open items that are still spewing out errors, but for now the majority of all graphs are working so things are ok:)

TheWitness commented 1 year ago

@eschoeller, we are releasing in December. So, it's pretty safe to take a plunge now. There will more changes, but not too drastic at this point.