IBM / spectrum-protect-sppmon

Monitoring and long-term reporting for IBM Spectrum Protect Plus. Provides a data bridge from SPP to InfluxDB and provides visualization dashboards via Grafana.
Apache License 2.0
13 stars 8 forks source link

Incomplete records in the vmBackupSummary table #9

Closed baslerj closed 3 years ago

baslerj commented 4 years ago

There are far fewer measurements being recorded in vmBackupSummary than expected. I am expecting several thousand per day, and there are far fewer than that. I am thinking this may turn out to be a similar problem as discussed in issue #6.

baslerj commented 4 years ago

@nkrs44 I made a change to resolve this in branch _issue9vmbackupSummary. The change moves the name field to tags and will certainly lead to the same questions being explored in #8. There's still a TODO to factor in any needed changes for managing the continuous queries. When I have some more time I can see how you handled this for #7 and try and replicate.

NielsKorschinsky commented 4 years ago

I have to check a bit more code, I'm not sure if the fix is applicable. If i remember right, vmbackupSummary is aggregated and actually shoudnt exist once per VM. I will give you Feedback once i checked.

Anyway it doesnt look like name would actually be a good identifier, i've noticed multiple times, that names are actually the same over multiple VM's and other external sources. Maybe we can find an ID somewhere, maybe edit the API-Request to stop "ignoring"/"thrashing" it.

NielsKorschinsky commented 4 years ago

If we have a problem with the vmBackupSummary deleting "duplicate" points we might have a way bigger issue, since we probably have a lot of "duplicate" points within the joblog-parsing. To be honest, I can't actually debug this issue now without access. Either we wait til 01.11.20 when I have access again or I can maybe describe you again, what i would look for.

Most basically, we currently save 2 messageID's into the 'vmBackupSummary' table: CTGGA0071& CTGGA2384. The most easy way for checking any errors would be using the InfluxDBStudio and the number of results within the last 13 days (we have a retention policy of 14 days, so just to be sure about the cutting point). You have to check the lines returned for the query where you only select both ID's and the time filter for the Job-Logs table, and once all result within the last 13 days for the vmBackupSummary-table.

If the results are only off by the count of one day, we do not have a issue. If they arn't, this is a quite big issue and requires a good workaround.

NielsKorschinsky commented 4 years ago

About the name issue, we wont be able to find any other id since we parse it. Depending on the results of the test we might have to add an aggregated index, but i do really hope we do not have to.

baslerj commented 4 years ago

Following is a simple example taken from debug where there were three messages to process in the insert queue for vmBackupSummary. In this case, the third record was dropped by Influx. All of the tag value were identical for these last two records. The timestamp values were unique while at the precision of milliseconds, but became indentical once the precision was dropped to seconds.

2020-10-03 10:00:00,715:[PID 34072]:DEBUG:influx_client.insert_dicts_to_buffer> tags: {'proxy': 'spppod2-vm14.storage.tucson.ibm.com', 'vsnaps': 'spppod2-vm14.storage.tucson.ibm.com', 'type': 'Incremental', 'transportType': '[nbd]', 'status': 'complete', 'messageId': 'CTGGA2384'} values: {'transferredBytes': 168500000, 'throughputBytes/s': 19450000, 'queueTimeSec': 0, 'protectedVMDKs': 2, 'TotalVMDKs': 2, 'name': 'spppod1-vm11_sppmondev'} timestamp: 1601742781848
2020-10-03 10:00:00,716:[PID 34072]:DEBUG:influx_client.insert_dicts_to_buffer> tags: {'proxy': 'spppod2-vm14.storage.tucson.ibm.com', 'vsnaps': 'spppod2-vm14.storage.tucson.ibm.com', 'type': 'Base', 'transportType': '[nbd]', 'status': 'complete', 'messageId': 'CTGGA2384'} values: {'transferredBytes': 32290000000, 'throughputBytes/s': 79920000, 'queueTimeSec': 0, 'protectedVMDKs': 1, 'TotalVMDKs': 1, 'name': 'pod2wingrp12_108'} timestamp: 1601742781851
2020-10-03 10:00:00,716:[PID 34072]:DEBUG:influx_client.insert_dicts_to_buffer> tags: {'proxy': 'spppod2-vm14.storage.tucson.ibm.com', 'vsnaps': 'spppod2-vm14.storage.tucson.ibm.com', 'type': 'Base', 'transportType': '[nbd]', 'status': 'complete', 'messageId': 'CTGGA2384'} values: {'transferredBytes': 32280000000, 'throughputBytes/s': 80300000, 'queueTimeSec': 0, 'protectedVMDKs': 1, 'TotalVMDKs': 1, 'name': 'pod2wingrp12_107'} timestamp: 1601742781854
baslerj commented 4 years ago

I have pushed another potential fix for this issue to the branch issue9_vmbackupSummary. I reverted the change to move name from fields to tags in definitions.py. I also made a change to jobs.py to adjust the timestamp entries as needed to ensure they are unique in the precision of seconds. This adjustment is only made when the table is vmBackupSummary for which perfect accuracy of the timestamps is not essential. Testing demonstrates the vmBackupSummary table is now populated with a complete set of records.

NielsKorschinsky commented 4 years ago

I really like your idea and the changes, its a nice and easy way to fix this issue. In general i would like to re-check why we were using second precision instead of MS precision. I forgot the reason why we had to downgrade it, but I'm sure we had a issue and reason about this. Maybe @dwendler can help me out remembering.

Anyway i fine-tuned the changes, just avoiding a few possible bugs while keeping your idea and code:

  1. I've sorted the log_list to make sure its not somewhat unsorted when comparing with <=. In the worst case this is just unnecessary and wont do any harm, in the best case it will stop a big timestamp erase if the timestamps switch a lot.
  2. Minor None-check for the timestamp.
  3. Using a existing method to automatically convert any precision to second precision, also I've kept the precision of the original value after the changes to avoid any further bugs due different precision's within the same list of data.

I've also added a minor change regarding #11 , removing a unnecessary check for the influx_client and renamed the to_epoch_secs method to make it clearer what it does.

baslerj commented 3 years ago

The most recent changes were not running correctly in my environment. I made to further updates. 1) The call added to sorted() resulted in the following exception. I was able to resolve this by using a different key for the sort.

  File "c:\Users\JasonBasler\source\repos\IBM\spectrum-protect-sppmon\python\sppmonMethods\jobs.py", line 235, in <lambda>
    sorted_log_iterator = sorted(supported_log_iterator, key=lambda item: item['supported_log_iterator'])
KeyError: 'supported_log_iterator'

2) The following block of code was not working correctly. I have simplified it, and it is now running as expected, although I may have lost something that was intended.

                if(cur_sec_timestamp <= max_sec_timestamp):
                    digits = max((int) (cur_timestamp / max_sec_timestamp), 1)
                    max_sec_timestamp += 1 # increase by 1 second
                    cur_timestamp = max_sec_timestamp * digits
NielsKorschinsky commented 3 years ago

I just checked those changes. First one was a typo, sorry for that.

Second point:

My intention was to create code which is safe if we choose to alter our timestamp precision from seconds back to milliseconds. I though this change might be required due multiple issues we are having due seconds-precision. I'd like to have a small talk or feedback about this point: Should i procduce code which is safe for a precision change or rather ignore and simplify the code?

Your changes currently purges the ms-precision down to seconds and save them as seconds beside other unmodifed ms-precision timestamps. My old code would have saved the purged timestamp back in the original format (ms). This would prevent any further issues with switching precisions within the same set of data.

Note: Currently this switching precision would only be inside the python code, all data is saved with second-precision within the database.

Could you describe what was not working as intended? I just reverted the code and debugged it. I discoverd a minor error but i am not sure if this would have breaked it.