FSLogix / Invoke-FslShrinkDisk

This script will shrink a FSLogix Disk to its minimum possible size
MIT License
155 stars 55 forks source link

Not suitable for FSLogix CCD #45

Open antonvdl opened 3 years ago

antonvdl commented 3 years ago

The shrink script works great; however when using CCD we see some issues. The script will shrink the VHD on the primary location; everything seems fine. However, FSLogix is not aware of any changes in the VHD since the meta file is not updated.

So FSLogix thinks that the VHD on the primary and the secondary location are equal and continues to write changes to both locations. This will result in corruption in the VHD on the secondary location.

Running the script on both primary and secondary location will not resolve this issue, since the shrink operation will give a different result on every run (on block level).

I think the best solution here would be to update the VHD.Meta file so FSLogix knows about the update.

lordjeb commented 3 years ago

There is an additional wrinkle on this one. There may be unwritten data in a cache location on a local machine that still has to be flushed to the storage locations. Any changes on the storage locations would cause problems, and might result in lost cache data.

Additionally, updating the meta file would not be enough, because the vhd file has been changed. So the meta file just keeps some data about what was flushed and when. In this case, a full resync of the updated vhd to the secondary location is going to be necessary.

The only other possibility I could think of would be to mount the vhd file though a machine running ccd and do the shrink operation there, allowing ccd to do its thing and sync all the data to multiple storage locations. Note that this would also handle potential locking issues that would not be handled otherwise.

antonvdl commented 3 years ago

In my understanding:

As long as the vhd is not locked there should be no local cache data, right?

(In our case we delete the local cache on logoff, but I believe this is also true for other environments)

For the meta file; as soon as fslogix sees that there is a difference in meta files the newest one takes precedence.

So the vhd in the secondary location will be automatically replaced by a copy of the vhd on the primary location.

I can’t find much information about this meta file, and I can’t read it. What I have seen is that the file only gets updated from an user session.

when I edit a vhd with frx the meta file remains unchanged (and my changes won’t be synced to the second location)

JimMoyle commented 3 years ago

You can safely shrink both locations separately the FSLogix agent doesn't recognise this as a change to sync over to the other side as nothing inside the VHD has changed. I'm adding an enhancement to not run when there is a RW disk there as that does have bad consequences, so only during a maintenance window right now.

@lordjeb I've tested the above previously and it works fine, let me know if you want to have a chat about it though.

lordjeb commented 3 years ago

@JimMoyle that's awesome, glad to know it works!

antonvdl commented 3 years ago

Hi @JimMoyle,

We are not using RW disks, but we do see issues after shrinking. The issues occur after the shrink when the user starts to write new data. New data will be synced correctly, but for old data the file table is incorrect at the secondary location.

after some writes and deletes the secondary location becomes corrupted

JimMoyle commented 3 years ago

@antonvdl That's interesting I'll test some more, but I have not heard any other instances of this. How many disks did this happen to at the time?

antonvdl commented 3 years ago

Hi Jim,

I can easily reproduce this with a fresh VHD:

Executing the same steps without the shrink operation will not give corruption.

I think FSlogix will sync the filetable of the VHD's. But the filetable is different because of the optimize.

Another example to show this behaviour (copying filetables) is by manually changing the primary VHD:

It looks like the FSLogix agent is synchronizing the VHD's on block level and not on file level. Because we run a process outside the FSLogix environment the block structure is changed and the secondary VHD will become corrupted. When the process would run inside the FSLogix environment then the changes would be equal on both VHDs.

When FSLogix knows that there is a difference between the VHDs (based on the meta file) it will discard the oldest VHD and replace it when the newest one.

lordjeb commented 3 years ago

@antonvdl Your takeaway is correct. FSLogix CCD feature synchronizes changes on the block level. It uses the meta file (the format of which is proprietary) to determine which sequenced changes have been committed to a storage location. So it can know which of multiple disks is the most up to date and which to use as the source of truth if others are less up to date.

I think some valid workarounds would be shrink the disk on location 1 + delete the disk on location 2. Or shrink location 1 + copy the shrunk vhd to location 2. Both of these are going to replicate what the FSLogix agent will do once it determines one of your vhd copies is out of date anyway (do a full copy of the vhd) so it seems like a minimal-downside change.

antonvdl commented 3 years ago

@lordjeb Thanks for your reply. I haven’t seen that the fslogix agent notice an out-of-date vhd when the vhd was altered outside an user session.

do you know when this will happen? Maybe I did not wait long enough.

lordjeb commented 3 years ago

@antonvdl During a login, FSLogix will detect that an agent is out-of-date based on the .meta file. It looks at information inside that file to know which changes have been flushed to which storage locations. So it won't detect this based on the timestamps on the vhd file or anything like that.

antonvdl commented 3 years ago

@lordjeb yes, I observed this behaviour. But when we alter the VHD outside the FSLogix agents scope (like with this script); then the FSLogix agent will not determine the changes.

So we would need to apply the work-around by etiher deleting the disk on location 2 or sync location 1 to location 2 after a shrink operation.

antonvdl commented 3 years ago

@antonvdl That's interesting I'll test some more, but I have not heard any other instances of this. How many disks did this happen to at the time?

Hi @JimMoyle; have you been able to replicate the issue? I think this is a wide spread issue where people are unaware of corrupted profiles in the second location. As long as you keep using the primary location you won't notice the issue.

StevenM79 commented 3 years ago

@JimMoyle @antonvdl I can confirm that disk maintenance is not suitable for Cloud Cache disks as this will eventually corrupt disk at secondary location.

Also if disk optimization is performed at secondary location, optimizations are lost when disk at primary location is being written to. FsLogix does not see that disk at secondary location has been modified and therefore will not resync from primary location.

How can we do correct disk maintenance to reclaim white-space? Without we have approx. 200TB of data, disk maintenance would reduce this to only 48TB. But because of disk corruption this doesn't seem to be a valid approach.

@antonvdl how did you resolve this?

antonvdl commented 3 years ago

@StevenM79 Unfortunately we didn't find a good solution within FSLogix for this issue. From our storage platform we can use dedup and compression so we win back a bit of the white-space.

Another solution would be to write a script around the shrink script that will delete the files on the secondary site. At the next login the secondary site will be recovered. However this approach has two issues:

Maybe you can only remove the metadata file on the secondary site to resolve the first issue; but I didn't test that scenario.

StevenM79 commented 3 years ago

@antonvdl Thanks for the update. I'm testing the removal of the metadata files at the moment, will post the results here.

We found that dedup on storage wins back some space, however optimizing disks would win back even more space as we found out. Other option could be to copy the disk to the other storage location after shrink, however this generates a lot of network traffic and disk I/O :(

StevenM79 commented 3 years ago

@antonvdl I have tested removing the meta file on all CCD locations after optimizing the disk on secondary CCD location. Unfortunately FSLogix does not detect that the disks on both CCD locations are different. So same problem unfortunately

Guess we will have to go the dedup on storage level route. In our case we will need a whooping 400TB per CCD location minimum. What are the dedup ratio's in your experience?

antonvdl commented 3 years ago

@StevenM79 What happens if you optimze the disk on the primary ccd location and then remove the meta file on the secondary?

The current dedup ratio is 3:1

StevenM79 commented 3 years ago

@antonvdl removing the meta file on secondary CCD seems to trigger a resync at logon from primary to secondary. Will test this some more, also scenario when primary is unavailable and secondary has no meta file etc. Downside is a lot of extra resync traffic at logonstorm in the morning. Not sure if I'm happy about that

StevenM79 commented 3 years ago

@antonvdl The CCD location where the meta file is removed will be seen as out-of-date on user-logon. This will cause a resync from the CCD location which still has a meta file. In case of outage before a resync was triggerd the system will connect to the CCD location where the meta file has been removed, and create a new meta file. User will work without problems. At next logon this will be the CCD location with the newer meta file causing a sync.

So it works correct if you remove the meta file on the CCD location where you did not perform disk maintenance. However this means that a lot of disks will be resynced at user logon in the morning.

I think this is unacceptable in a large environment. So i decided to further investigate the dedup on storage level scenario. Still gives me headaches as we have a share limit of 256TB. With 2 storage locations, 20000 disks and the expectation of growth to 20GB for each disk i need a lot of shares and diskspace.

Working in the cloud with O365 requires a lot of local storage for the FSLogix caching solution, i have a hard time explaining this to our management. They are in the illusion that they don't need local resources when working with cloud solutions...

antonvdl commented 3 years ago

@StevenM79 good to hear that this scenario works. The resync in the next morning is indeed a big issue. You could edit the script to limit the amount of VHDs that get shrinked in the same window. That should limit the effect during the next morning, but makes everything more complex again.

vikrant003 commented 2 years ago

Hi, Anyone know if shrink script is suitable to CC? Do you guys see much profile corruption with or without running shrink? CC has become PITA for sometime..I didn't see Jim's comments since Aug last year on this thread so wondering if it was really addressed.. one thing finally.. experience you guys share is very helpful..Thanks much..

StevenM79 commented 2 years ago

@vikrant003 I decided not to use the shrink script with CC. Main reason for that is that it corrupts de disks on the location where you did not run the shrinking script. This can be solved by removing the META file on the location that where the shrink operation was not perfomed, which causes a full fie copy from the other location. When you have a lot of disks this is a significant data transfer that has to be done. Instead of shrinking the disks we decided to rely on storage deduplication which is part of the Dell PowerStore storage solution we use. This gives us nearly the same storage gains as the shrink script.

antonvdl commented 2 years ago

@vikrant003 Exactly the same here. Not using Dell PowerStore but a different vendor who also offers deduplication.

vikrant003 commented 2 years ago

@StevenM79 and @antonvdl ..thank you for your feedback..

mav147 commented 1 year ago

We believe this is the source of problems we're seeing too, getting corruption in FSLogix. Due to our setup (trying to be highly available) users can switch between primary and secondary sites each time they log into AVD. Does anyone know if the newer "shrink on logoff" feature in FSLogix causes the same issues as the shrink script?

lordjeb commented 1 year ago

@mav147 The new feature in FSLogix should not have the same issues. It works within the FSLogix agent with the disks mounted by CCD. So, as it makes optimization changes, these updates are written out to all providers, and the .meta files are updated so that they can be kept in sync.

mav147 commented 1 year ago

@mav147 The new feature in FSLogix should not have the same issues. It works within the FSLogix agent with the disks mounted by CCD. So, as it makes optimization changes, these updates are written out to all providers, and the .meta files are updated so that they can be kept in sync.

That makes sense, thanks. We'll give it a try :)

msft-jasonparker commented 1 year ago

@mav147 The new feature in FSLogix should not have the same issues. It works within the FSLogix agent with the disks mounted by CCD. So, as it makes optimization changes, these updates are written out to all providers, and the .meta files are updated so that they can be kept in sync.

That makes sense, thanks. We'll give it a try :)

Keep in mind that your sign-out times will be significantly increased. In order to compact the VHD, we bring the entire contents local from the storage provider. We then evaluate the VHD to determine if we can compact / save space. If we are able to compact and save space, we perform the operation and then the VHD must be uploaded to ALL storage providers.

All of these actions are part of the user sign-out operation.