almenscorner / IntuneCD

Tool to backup, update and document configurations in Intune
MIT License
277 stars 35 forks source link

[BUG] inconsistent assignment information #163

Closed Wob76 closed 6 months ago

Wob76 commented 7 months ago

Describe the bug I am seeing inconsistencies values between backups related to assignment information.

"App Configurations" and "Applications" are at times not returning any "assignments" information

"Compliance Policies/Policies" is returning a inconsistent values for "assignments/deviceAndAppManagementAssignmentFilterId", one run it will return the GUID and on the new run it will return the String Value.

No errors are produced from the backup run itself.

To Reproduce Run 2 backups and compare changes

I am seeing the behaviour both when running in a DevOps Pipeline and when running locally in CLI (using the same Application Credentials in both)

Local backups seem more consistent, will only return no information\GUID every 3rd or 4th run, DevOps seems to be more like 1/2 and almost alternates between runs.

Expected behavior I am expecting a consistent result so that real changes are not lost in changes that don't actually exist.

Run type (please complete the following information):

Using the following options; --mode=1 --output=json --path="C:\Temp\Intune-Backup" --exclude CompliancePartner --append-id

Please let me know if I can supply any further details or complete any more tests.

almenscorner commented 7 months ago

Interesting, I haven't seen this in my env which has been running for a year. Is all other properties on the backup there? is all the count of configurations correct in the backup? can you see and 504s or something else in the backup stream?

Is it possible to share some print screens of the commit history on the "change" that was made and on the actual configuration in Intune?

Wob76 commented 7 months ago

Hi,

Thanks for the response, I was running this against a small development environment during early testing that I didn't see the issue. So it may be due to something in our production environment.

To give an idea of scale the backups are ~22Mb (in Json format) contains 2,183 Files and 42 Folders. and the as-built documentation in markdown is ~222Mb.

The backup does appear to be complete, at least skimming through the counts look good.

The output from running the command doesn't have any errors, is there process to increase debugging? I did look but couldn't find it.

Here are some screenshots (with mild redaction), noting I will see the opposite change every run or two.

Example of deviceAndAppManagementAssignmentFilterId changing from GUID to Name in an application (I see the same thing on Device Configurations) image

Example of the assignments returning to a Application image

almenscorner commented 7 months ago

My setup is quite big as well with ~83MB and thousands of files. There is no way to increase the output currently unfortunately, however, time-outs and throttling is managed in the calls.

I'm trying to think of a way to figure out what might be going on as I haven't seen it myself or heard anyone else saying that the tool behaves like this.

Wob76 commented 7 months ago

Thanks, good to hear it shouldn't be the size at least. Let me know if I can do anything further at my end to help diagnose.

Wob76 commented 7 months ago

An update from my end,

I currently have my pipeline scheduled to run daily at 1am, and it has for the last 3 days at least, run without issues.

Today, with help from a college who's python skills are less rusty than mine we linked all the missing data back to the 'batch_request' function and with some extra debugging noticed we at times get no data returned from it . Then it doesn't have any error outputs for anything != 200, so we added a little code to see what was happening.

image

It would appear that, at least while testing, we are hitting a throttle limit, we see the following error;

Too many requests from Identifier:{TENANT_ID}+{APP_ID} under category:throttle.intune.ags.deviceintent.tenant.app. Please try again later.

I redacted the IDs but that match my Tenant and the App I am using. I can't quite find what the throttle limit is that we are hitting but it might be nice to add some handling for it if possible.

We also see some errors;

"Message": "DeviceAndAppManagementAssignmentFilter Id cannot be an empty guid. - Operation ID (for customer support): 00000000-0000-0000-0000-000000000000

We seem to have a good chunk of items triggering this error, it may help with throttle limits if these are removed from the list before calling the batch?

Hopefully this helps, at least our scheduled job seems OK for the moment, but it would be nice to know if we can avoid hitting this limit (or increasing it for the app if we can)

Thanks

almenscorner commented 7 months ago

Thank you very much for the debugging effort, this helped a lot!

Looking into this, when batching, throttling was not managed on individual requests in the batch. I have released a new beta version that includes throttle management in batching and also skips the empty guids for filters to remove unnecessary calls.

Latest beta release notes: https://github.com/almenscorner/IntuneCD/releases/tag/v2.0.9-beta.4

Please install the latest beta and do some tests to see if the throttled requests are being handled: pip3 install IntuneCD==2.0.9b4

Wob76 commented 7 months ago

Wow, thanks for the prompt turnaround.

I have run 4 backups from my workstation and my college 2, and all have returned the expected results in assignments, so looking great.

Do you have a timeline when 2.0.9 build will be released?

We did see some errors using Python 3.8.10, which stopped the run. Updating to Python 3.12.2 resolved them, I did have a search in the repo to see what the Python requirements are but couldn't see it called out, might be worth a mention or a requirement check in pip? image

I have another request, happy to open a feature request if you think it worthy. We are looking to use a pipeline (as per this blog) as a configuration capture\change monitor, as per his suggestion I am excluding "CompliancePartner" and the output is good. But it would be nice to have an exclude option "usedLicenseCount" as this changes regularly and not having it would reduce some noise in the backups.

Thanks

Wob76 commented 7 months ago

I just attempted the Beta in my pipeline but it generate the same error as above, the pipeline uses the ubuntu-latest image and it looks like that includes Python 3.10.

I updated the pipeline to use python to 3.12 but it still errors.


  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/run_backup.py", line 278, in start
    run_backup(
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/run_backup.py", line 225, in run_backup
    backup_intune(results, path, output, exclude, token, prefix, append_id, args)
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/backup_intune.py", line 40, in backup_intune
    results.append(savebackup(path, output, exclude, token, prefix, append_id))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/backup/Intune/backup_compliance.py", line 37, in savebackup
    assignment_responses = batch_assignment(
                           ^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/intunecdlib/graph_batch.py", line 251, in batch_assignment
    get_filter_names(responses, filter_ids, token)
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/intunecdlib/graph_batch.py", line 155, in get_filter_names
    filter_responses = batch_request(
                       ^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/intunecdlib/graph_batch.py", line 104, in batch_request
    request = makeapirequestPost(
              ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/intunecdlib/graph_request.py", line 205, in makeapirequestPost
    raise requests.exceptions.HTTPError(
requests.exceptions.HTTPError: Request failed with 400 - {"error":{"code":"BadRequest","message":"Invalid batch payload format.","innerError":{"date":"2024-02-13T03:22:44","request-id":"d94cc424-a57f-4ae3-8620-1cb7486782aa","client-request-id":"d94cc424-a57f-4ae3-8620-1cb7486782aa"}}}

##[error]Bash exited with code '1'.
##[error]Bash wrote one or more lines to the standard error stream.
##[error]Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.12.1/x64/bin/IntuneCD-startbackup", line 8, in <module>
    sys.exit(start())
             ^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/run_backup.py", line 278, in start
    run_backup(
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/run_backup.py", line 225, in run_backup

##[error]    backup_intune(results, path, output, exclude, token, prefix, append_id, args)
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/backup_intune.py", line 40, in backup_intune
    results.append(savebackup(path, output, exclude, token, prefix, append_id))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/backup/Intune/backup_compliance.py", line 37, in savebackup
    assignment_responses = batch_assignment(
                           ^^^^^^^^^^^^^^^^^

##[error]  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/intunecdlib/graph_batch.py", line 251, in batch_assignment
    get_filter_names(responses, filter_ids, token)
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/intunecdlib/graph_batch.py", line 155, in get_filter_names
    filter_responses = batch_request(
                       ^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/intunecdlib/graph_batch.py", line 104, in batch_request

##[error]    request = makeapirequestPost(
              ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.1/x64/lib/python3.12/site-packages/IntuneCD/intunecdlib/graph_request.py", line 205, in makeapirequestPost

##[error]    raise requests.exceptions.HTTPError(
requests.exceptions.HTTPError: Request failed with 400 - {"error":{"code":"BadRequest","message":"Invalid batch payload format.","innerError":{"date":"2024-02-13T03:22:44","request-id":"d94cc424-a57f-4ae3-8620-1cb7486782aa","client-request-id":"d94cc424-a57f-4ae3-8620-1cb7486782aa"}}}
almenscorner commented 7 months ago

There is a fix for the 400 bad request in the dev branch, didn't have time to create a new release yesterday. I will do that as soon as possible!

almenscorner commented 7 months ago

As for the "usedLicenseCount", this is for VPP apps if I'm not mistaken?

almenscorner commented 7 months ago

2.0.9, which most likely will become 2.1.0 is set to be released this week :)

Wob76 commented 7 months ago

Thanks for the prompt turn around, you are fast.

Yes, that would line up, I only see it in the iOS and MacOS Applications.

image

Some of the other data still looks useful but the only value I can see in "usedLicenseCount" would be ensuring it is bellow "totalLicenseCount" but we don't have plans to monitor that as part of this flow at the moment, so it would be nice to excluded it, I could run with exclude VPP as see what else I miss?

almenscorner commented 7 months ago

The batch retry issue should now be fixed in v2.1.0-beta1 as well as the option to exclude license count by including -e VPPusedLicenseCount when running the backup.

The this version and let me know how it goes: pip3 install IntuneCD==2.1.0b1

Wob76 commented 7 months ago

This looking fixed to me, I ran the b1 locally and in my pipeline without issue, -e VPPusedLicenseCount is working as expected.

I noticed the release of 2.1.0 and checked my main pipeline was using it, added -e VPPusedLicenseCount and again working as expected.

I will report back if I see any re-occurrences of my assignment issues with the throttling in place.

Wob76 commented 7 months ago

@almenscorner

Looks like this needs a revisit, I have completed about 6 or 7 local backups that have all run without issue, but my pipeline is still seeing incorrect assignment information, even on my scheduled run overnight the GUID where returned.

The pipeline seems to run faster than my local runs so maybe it is hitting the throttle due to speed, If I am reading the code right you only have the 1 retry? Maybe if we could have more than 1, or a longer delay it may help?

I'll try and get a run in today with some debugging, but it is harder now as I am not seeing the same behaviour with local runs.

Thanks,

Wob76 commented 7 months ago

I've done a little debugging at my end, as a note, when running this locally a backup takes ~10-12min, my pipeline finishes in half that time, so I feel like the speed is the issue for me.

I am able to force the error on my local runs but having 2 backups run at the same time.

I added some debugging output to the batch_request function.

The primary trigger for retries looks to be in the Application Batches, and it seems to do two batches. I have also see a small retry batch in the Compliance Policy section, but this may be due to it being right after the applications?

I added some lines to output a count of the retry_pool at the start of the "if retry_pool", also output the same count just before the return.

I noticed you add to the retry_pool rather than starting a new pool in the handle_responses, so the count increased, I did some math and assume if the pool increases after the retry then we have some that failed the 2nd pass and called them failed.

Here is some output of the application run

Retry Pool: 646
Wating 0 seconds
----
Failed Pool: 662
Failed Count: 16
----
Retry Pool: 1564
Wating 20 seconds
----
Failed Pool: 1564
Failed Count: 0

Comparing my output I see 16 files that did not contain a 'assignments' so it matches up with the failures.

I noticed on the first run the wait remains as 0, so I assume the 'Retry-After' response in those doesn't exist, I tried bumping this to 5 and then 20, and I did find, at least with that batch it improves fail count. So I would suggest maybe bumping the initial\default from 0 to 20, or make it a command line option?

This alone did help improve results, but it was still flaky, so it may need something more robust in the retry section, more attempts maybe, I would also suggest adding an Error output if any failed attempt exist at the end of the batch?

Here is another example of poor results on the 2nd batch in the applications.

Retry Pool: 705
Wating 20 seconds
----
Failed Pool: 705
Failed Count: 0
----
Retry Pool: 2040
Wating 20 seconds
----
Failed Pool: 3960
Failed Count: 1920
Wob76 commented 7 months ago

FYI I just submitted a pull with my attempt at a fix, feel free to pick it apart as I haven't touched python in quite some time and I didn't dig much outside the graph_batch.py file. Now I best stop doing after hours work ;)

almenscorner commented 7 months ago

I have just pushed a new beta with some fixes for this, check the releases and try the new beta please.

almenscorner commented 7 months ago

Verbose output will include info from the "batch_request" function like,

[Thu Feb 15 11:32:54 2024] - [batch_request] - Retrying failed requests, retry pool count: 249
[Thu Feb 15 11:33:26 2024] - [batch_request] - Retry count: 1
[Thu Feb 15 11:33:26 2024] - [batch_request] - Retrying failed requests, retry pool count: 4
[Thu Feb 15 11:33:47 2024] - [batch_request] - Retry count: 2
[Thu Feb 15 11:33:53 2024] - [batch_request] - Failed requests after 2 retries: 0
Wob76 commented 7 months ago

I've just completed a run in my pipeline using 2.1.2b2 and it is looking good. You can see that the retry count doesn't start moving for about 3-4 reties in the bigger second batch, I found the same with my fix, I would suggest bumping the retries to around 10, or maybe add some padding to the pause.

2024-02-16T00:03:51.9753596Z [Thu Feb 15 23:59:37 2024] - [batch_request] - Retrying failed requests, retry pool count: 22
2024-02-16T00:03:51.9753897Z [Thu Feb 15 23:59:37 2024] - [batch_request] - No wait time in headers, sleeping for 20 seconds...
2024-02-16T00:03:51.9754136Z [Thu Feb 15 23:59:57 2024] - [batch_request] - Retry count: 1
2024-02-16T00:03:51.9754389Z [Thu Feb 15 23:59:57 2024] - [batch_request] - Failed requests after 1 retries: 0
2024-02-16T00:03:51.9754673Z [Fri Feb 16 00:02:03 2024] - [batch_request] - Retrying failed requests, retry pool count: 1231
2024-02-16T00:03:51.9754907Z [Fri Feb 16 00:02:30 2024] - [batch_request] - Retry count: 1
2024-02-16T00:03:51.9755176Z [Fri Feb 16 00:02:30 2024] - [batch_request] - Retrying failed requests, retry pool count: 1231
2024-02-16T00:03:51.9755410Z [Fri Feb 16 00:02:55 2024] - [batch_request] - Retry count: 2
2024-02-16T00:03:51.9755680Z [Fri Feb 16 00:02:55 2024] - [batch_request] - Retrying failed requests, retry pool count: 1231
2024-02-16T00:03:51.9755911Z [Fri Feb 16 00:03:24 2024] - [batch_request] - Retry count: 3
2024-02-16T00:03:51.9756324Z [Fri Feb 16 00:03:24 2024] - [batch_request] - Retrying failed requests, retry pool count: 711
2024-02-16T00:03:51.9756560Z [Fri Feb 16 00:03:51 2024] - [batch_request] - Retry count: 4
2024-02-16T00:03:51.9756814Z [Fri Feb 16 00:03:51 2024] - [batch_request] - Failed requests after 4 retries: 0
2024-02-16T00:03:51.9757025Z Backing up Application: ......

2024-02-16T00:05:11.9828136Z [Fri Feb 16 00:04:03 2024] - [batch_request] - Retrying failed requests, retry pool count: 154
2024-02-16T00:05:11.9828650Z [Fri Feb 16 00:04:25 2024] - [batch_request] - Retry count: 1
2024-02-16T00:05:11.9829119Z [Fri Feb 16 00:04:25 2024] - [batch_request] - Failed requests after 1 retries: 0
2024-02-16T00:05:11.9829548Z Backing up compliance policy: ...
almenscorner commented 7 months ago

2.1.2b3 has been released with some additional tweaks:

New features

Wob76 commented 6 months ago

I updated to 2.1.2b3 on Friday night and the backups over the weekend look good, the additional backoff works well and all 3 backups finished within 2 retry loops. Happy to call this closed, looking forward to the stable :)