btrfs / btrfs-todo

An issues only repo to organize our TODO items

21 stars 2 forks source link

Preferred metadata design #19

Open josefbacik opened 3 years ago

josefbacik commented 3 years ago

We need to agree on how this system will work before we start writing code again. The main goal is to provide users the ability to specify a disk (or set of disks) to dedicate for metadata. There's a few policy questions that need to be asked here. Please discuss with new comments, and as we come to a consensus I will update this main entry with the conclusion of our discussions.

Policy questions

[ ] ENOSPC
- [ ] Do we fail when we run out of chunks on the metadata disks?
- [ ] Do we allow the use of non-metadata disks once the metadata disk is full?
- [ ] Do we allow the user to specify the behavior in this case?
[ ] Device Replace/Removal
- [ ] Does the preferred flag follow the device that's being replaced?
- [ ] What do you do if you remove the only preferred device in the case that we hard ENOSPC if there are no metadata disks with free space?
[ ] On the fly changing of existing metadata disks.
- [ ] If you unset the flag on one disk, is the metadata moved?
- [ ] If you unset the flag on the only metadata disk, what happens? (Similar to device removal?)

Userspace implementation

[ ] sysfs interface for viewing current status of all elements.
[ ] sysfs interface at least for setting any policy related settings.
[ ] A btrfs command for setting a preferred disk.
- [ ] The ability to set this at mkfs time without the file system mounted.

Kernel implementation

[ ] The device settings need to be persistent (ie which device is preferred for the metadata).
[ ] The policy settings must also be persistent.
- [ ] The question of how to store this is an open question. I have suggested using the btrfs property xattrs and storing them in the tree-root, so they wouldn't be visible to anything except btrfs. We need to decide on some sort of way to deal with system wide settings, and we'll use this feature to figure out what we think the canonical way forward is for that.

josefbacik commented 3 years ago

For me

ENOSPC
- I'm fine with a hard/soft policy for this. Hard being we run out of allocatable chunks on the preferred metadata devices then you simply ENOSPC. This is tricky to convey with df however, and would need to have some reflection in btrfs filesystem usage so the user knows they only have X allocatable space for metadata in the case of the hard policy.
Device replace
- If we're replacing a metadata preferred device, then it sticks with the device. The way device replace works currently is the device item is left untouched mostly, we just update the id to point at the new device path. Mechanically this would work out fine, just don't touched the metadata preferred property. If the user wants to change it they can do so later.
On the fly changing.
- If we're unsetting the preferred metadata flag I think we should trigger a balance for any metadata chunks that are on that device.
- If we have the "hard" policy then you should simply be disallowed from unsetting the flag on the last disk. For "soft" it should be allowed.
Where to store the preferred policy.
- We are going to want to have more policies in the future that apply to the entire file system. I, for example, want to be able to specify allocator behaviors persistently. I think the btrfs property thing is a nice fit for this, because these are simple selections. "hard" or "soft" for the preferred metadata policy. We simply add new strings to the property as we think of new horrible ways to abuse it, and we don't have to worry about tooling for a new item flag or whatever. We store the property xattr in the tree_root so that it's not visible via lsattr and doesn't get pulled down via rsync or anything like that. However I could be convinced of alternative approaches, but generally I want to have something generic we all agree on so that I (and I believe @morbidrsa) can use it for our own purposes in the future.

Zygo commented 3 years ago

Policy questions

[ ] ENOSPC
- I'm opposed to the global "hard" vs "soft" vs "metadata" idea. It's easier to label all the drives by preferred chunk type, and whether they allow fallback to the other type or not. It's not possible to specify good behavior for more than 2 speeds of drive with a single global option.
- There are use cases for allowing and disallowing metadata on data disks and vice versa. The policy should be separate for each disk, i.e. each disk should specify whether it accepts data or metadata separately, and each disk should specify whether it prefers metadata or data separately. That makes 4 preference levels for each disk:
- metadata only - first in line for metadata chunks, data is never allocated here
- metadata preferred - second in line for metadata chunks, third in line for data chunks
- data preferred - third in line for metadata chunks, second in line for data chunks
- data only - metadata is never allocated here, first in line for data chunks
- Note the sort order for data is the opposite of metadata.
- The behavior in the "never allocate here" cases should be exactly the same as if all space on the disk in question was fully allocated, and you need a new chunk of the respective type.
- df should not count metadata-only devices as free space (bavail).
- Each level includes the previous level. For a multi-disk profile, we try to use the minimum number of preference levels that satisfy the minimum device requirement. i.e. first we try to get min_devs for a data chunk using only data-only disks, then we try data-only and data-preferred, then we try data-only, data-preferred, and metadata-preferred, then we give up and ENOSPC because policy does not allow us to use any other disks. This affects the stripe widths you get with the striping RAID profiles.
- There needs to be a default policy for new drives when they are added to the filesystem. I think it should either be "data preferred", or the user must specify the default policy when they set up any preference at all (this is easiest if the config is a single string).
- I don't believe we can have fewer than 4 states for each disk, as fewer than 4 will make some performance use cases impossible. I had an "extra credit" version of this on the mailing list which allows specifying arbitrary sort orders across all devices, but arbitrary device fill order isn't a feature I desperately need.
- e.g. more than 2 preference levels: you could have 4 disks of different speeds (1 is fastest, 4 is slowest), and want metadata to overflow from disk 1 to disk 2, but not disk 3 or 4, because metadata performance on disks 3 and 4 would be crippling. Data would be the opposite, overflow from 4 to 3 would be allowed, but not 2 or 1, because disks 1 and 2 are the only place we can put metadata and we can't afford to waste that space on data.
- e.g. when performance is more important than space: a more typical configuration is 2 NVME's and 6 or more spinning disks. In that case we dedicate the NVME to metadata-only and everything else is data-only. No metadata allowed on the spinning disks as suddenly everything gets 95% slower when that happens. No data on the NVME as it steals metadata space, and on some NVME firmware we want to keep them only half full to avoid write multiplication.
- e.g. when performance is not as important as space: you prefer to have metadata on disk 1 and data on disk 2, but the SSD is too big to waste on just metadata, and you don't want to ENOSPC with room still available on the slow drive. In that case you'd only use "metadata preferred" and "data preferred" to get most of the chunks in the right place, but still use all of the available space.
- If the configured preferences say we can't use a disk for (meta)data, then we can't use that disk for (meta)data (same as if the disk was full). The user must choose between prioritizing performance or space, they are mutually opposite goals.
[ ] Device Replace/Removal
- Device preferences follow the device that's being replaced. The replaced drive will have exactly the same chunks on it that it had before, so it's reasonable to assume it has the same preferences too. If the device preferences change (e.g. replacing SSD with HDD) then the admin can change the preference attributes and start a balance to rearrange the data.
- There should be no requirement for a device to be online to change its preference. So if you are replacing a SSD with a HDD, you can change the preference on the device before or after the replace.
- When we remove a drive, we check to see if there are enough devices for all the currently existing RAID profiles. This would be adjusted to count or not count devices depending on preference, e.g. if you have RAID1 metadata, you can't remove a device from the filesystem if it means you are left with fewer than 2 data-preferred, metadata-preferred, or metadata-only drives. The admin can change the preferences first and then the removal would succeed, same way they can convert the metadata to single and then remove succeeds. You also have to have sufficient space for the relocated data on the remaining devices after preference filtering, but I don't know off the top of my head whether we bother to check that in dev remove.
- The preferences only take effect at chunk allocation time, so you can change preferences at any time (provided that the new preferences have enough of each kind of disk for RAID profiles). The preference affects only new chunks, so existing chunks are left alone.
- If you want the data moved to match the updated preferences, run a balance at some convenient time. We definitely do not want this balance auto-triggered. Metadata balances take weeks to months to run on the kinds of servers I want to use this feature on--they have to be scheduled on a calendar, and paused outside of maintenance windows.
- Extra credit: add a new balance filter for matching "chunk that is on a disk with non-matching data/metadata preferences." So if you flip some disks from metadata-preferred to metadata-only, you run btrfs balance start -dpreferred /... and any data chunks on those disks are relocated onto other disks (or you run out of space and ENOSPC).

Userspace implementation

Yes to all of these:
- [ ] sysfs interface for viewing current status of all elements.
- [ ] sysfs interface at least for setting any policy related settings.
- [ ] A btrfs command for setting a preferred disk.
Extra credit:
- [ ] The ability to set this at mkfs time without the file system mounted.

Kernel implementation

Yes to this:
[ ] The device settings need to be persistent (ie which device is preferred for the metadata).
[ ] The policy settings must also be persistent.
Store it as a string. (Just one?)
Update string from btrfs tools. Kernel verifies and compiles the string before allowing changes, including minimum device count checks. If the new string is rejected, the old string (if any) is kept. btrfs userspace doesn't vet the string, so we don't need to change it for every kernel (though userspace might analyze the string to provide helpful or localized error messages).
- I'd also be OK with the tools compiling it to a binary blob and passing that to the kernel, e.g. like we do for balance args.
If the kernel can't understand the on-disk string (old kernel on filesystem touched by new kernel) then...just ignore it, since that's what all existing kernels will do. Drop a warning in dmesg in new kernels.
- We could set a read-only incompatible bit, but that seems like overkill at the moment? Technically the allocator from old kernels will still correctly read and write data, just in undesirable places. Maybe only set the incompat bit when an "-only" preference is set, as violations of the "-only" requirement are very expensive to recover from.
Store the config (whatever form it takes) as a persistent item in the dev tree, where we currently put device stats and replace progress (but not balance progress?). Plenty of persistent item objectids left, and the offset could be a syntax revision field.

josefbacik commented 3 years ago

Ok so you envision something more generic than simply "allocate metadata on metadata preferred devices", and more classifying devices on a spectrum, and using that spectrum to infer policy.

I actually like that idea better, because it removes the need to have a separate flag to indicate what the policy is, we simply tag devices with their policy and carry on. From the point of view of the user it would map out roughly like this

# mkfs
mkfs.btrfs --data-preferred /dev/sdb,/dev/sdc --metadata-preferred /dev/sdd,/dev/sde --metadata-only /dev/nvme0n1

# on the fly
echo 'metadata-only' > /sys/fs/btrfs/<uuid>/allocation-preferences/sdc/preference

cat /sys/fs/btrfs/<uuid>/allocation-preferences/sdc/preferences
none [metadata-only] metadata-preferred data-only data-preferred

# unmounted
btrfs device preference --set 'metadata-only' /dev/nvme0n1

# mounted
btrfs device preference --set 'metadata-only' /dev/nvme0n1 /path/to/fs

For this the on disk implementation would simply be a new item on the device tree per device. Implementation wise I would do something like

[ ] Lookup this device item for each device of the file system, default to no preference if the item doesn't exist.
[ ] No preference means the device goes on fs_devices->alloc_list. Add a respective list to fs_devices for each of the different allocation preferences.
[ ] Change __btrfs_alloc_chunk() to abstract out searching the fs_devices->alloc_list and instead search through the respective list. The ENOSPC handling will do the correct thing from there, all you need to do is populate it based on the list you need to be searching through.
[ ] Unfortunately we sort of need something like Qu's work to fix ->free_chunk_space, as now this becomes an even more important number based on our more complicated setup. If this is wrong we'll make really bad decisions early on in the ENOSPC machinery.

I think that's it code wise.

Zygo commented 3 years ago

Sounds good. One question: What does "no preference" mean? metadata-only, metadata-preferred, data-preferred, and data-only imply an ordering, so what is "none" and where does it fit in the order?

Or is "none" an alias to one of the others and the default value for new devices, e.g.

# default filesystem with no preferences
cat /sys/fs/btrfs/<uuid>/allocation-preferences/default/preference
metadata-only metadata-preferred data-only [data-preferred]

# set our fastest devices to metadata-only, all other devices to data-only
echo metadata-only | tee /sys/fs/btrfs/<uuid>/allocation-preferences/nvme*/preference
echo data-only | tee /sys/fs/btrfs/<uuid>/allocation-preferences/sd*/preference

# change what 'none' means, so metadata doesn't leak onto new devices with default preferences:
echo data-only > /sys/fs/btrfs/<uuid>/allocation-preferences/default/preference

and if so, wouldn't it be better to call it "default" instead of "none"?

kreijack commented 3 years ago

Policy questions

ENOSPC

Do we fail when we run out of chunks on the metadata disks?

IMHO no, a lower performance is far better than a metadata -ENOSPC. Anyway we should create a logging system to warn the user about this situation. (like the one that warns the user if there are different profiles in the same filesystem).

Do we allow the use of non-metadata disks once the metadata disk is full?

As above, yes. However there are some corner cases that have to be addressed. Supposing to have:

sda -> metadata
sdb -> metadata
sdc -> data
sde -> data
sdd -> data
data profile: raid5
metadata profile: raid1

Normal case: data spans sd[cde], metadata spans sd[ab]. What should happen if sdd is full: 1) return -ENOSPC 2) should data span sd[abce] or span only 3 disks ? (the data could have even more space )

Do we allow the user to specify the behavior in this case?

I think no. However I am open to change my idea if there is a specific user case.

Device Replace/Removal

Does the preferred flag follow the device that's being replaced?

In the general case no, I don't see any obious behavior; we can replace a faster disk with a slower one. However btrfs disk replace/remove should warn the user about the possibles risks

What do you do if you remove the only preferred device in the case that we hard ENOSPC if there are no metadata disks with free space?

I think that the "preferred" is just an hint. I prefer a better btrfs-progs that warn the user about this situations (preferred disks full)

Userspace implementation

sysfs interface for viewing current status of all elements.

Definitely yes

sysfs interface at least for setting any policy related settings.

I agree, the only limits is that is difficult to implement an "atomic change of multiple values". I don't know if this is the case however.

A btrfs command for setting a preferred disk. The ability to set this at mkfs time without the file system mounted.

It does make sense. Anyway my suggestion is still to allow a flag to the mount command to set a "standard" behavior in an emergence situation. May be that the mount options are "transitory"; instead the sysfs are "permanentely"

Kernel implementation

The device settings need to be persistent (ie which device is preferred for the metadata).

Agree

The policy settings must also be persistent.

Agree

The question of how to store this is an open question.

The xattr was a fascinating idea. Unfortunately it suffers of two problems:

a simple cp or a rsync commands between two filesystem may create unexpected results. This may be mitigated preventing flistxattr() to return these setting. The biggest limits is that we never know which xattr are set. We could only try. (what happens if an xattr with an incorrect name is set ? it will last forever )
it requires that the "root subvol" (id=5) is mounted. To mitigated this, we could force btrfs to access the root subvolume xattr when the prefix of the xattr name starts with "btrfs.fs." (instead of "btrfs.sub." for subvolume or "btrfs.file." for file)

The other way is the one which was used to store the default subvolume. The only differences is to use an extendible (and versioned) structure to hold several parameters (even not related). A "dirty" flag marks the structure to be committed in the next transaction (see btrfs_feature_attr_store() ).

Zygo commented 3 years ago

This is one area where Goffredo and I disagree. I have use cases where there absolutely must not be data or metadata on a non-preferred device type. In my examples above "metadata-only" and "data-only" get used more often than "metadata-preferred" or "data-preferred."

In my test cases I never use metadata-preferred or data-preferred at all. I could live with just metadata-only and data-only, but I know others can't, so I included metadata-preferred and data-preferred in all variations of my proposal.

On a server with 250GB of NVME and 50TB of spinning disk, It's a 95% performance hit to put metadata on a non-preferred device, and a 0.5% space gain to use preferred devices for data. That tradeoff is insane, we never want that to happen and we'd rather have it just not be possible. We're adults, we can buy NVME devices big enough for our metadata requirements.

josefbacik commented 3 years ago

I'm with Zygo here, having first hand had teams tell me they'd rather have X fall over hard than slowly degrade. I think its valuable to have the metadata-preferred/data-preferred model for people who want the perf boost+safety, but it's equally valuable for Zygo and other people who would rather it be fast and plan appropriately than to have it slowly break.

kreijack commented 3 years ago

Ok, let me to summarize the algorithm. We have 5 disk classes:

METADATA_ONLY (the disk can host only metadata)
PREFERRED_METADATA (the disk is preferred for metadata, in low space condition it can host data)
UNSPECIFIED (a disk without any hints)
PREFERRED_DATA (...)
DATA_ONLY (...)

The above ordering is for metadata. For data, the ordering is reversed. The disks are ordered first by level and then by available space. After the sorting the last "*_ONLY" disks are excluded from the consideration by the allocator.

The allocator takes the disks from the first group. If these are not enough (ndevs < devs_min), it extends the list of disks using the second group. If these are not enough, it extends the list of available disks up to the 3rd group. If these are not enough, it extends the list of available disks up to the 4rd group.

To simplify I would like to know your opinion about grouping the different levels:

[ ] group, in the ordering, the devices tagged by METADATA_ONLY and PREFERRED_METADATA
[ ] group, in the ordering, the devices tagged by UNSPECIFIED and PREFERRED_DATA

Zygo commented 3 years ago

The "disk with no hints" case is a problem. If we have split the filesystem into disks that only have data or metadata, but we add a new disk, btrfs might put the wrong type of chunk on it.

There's a few solutions:

Require that of the following must be true:
- all devices must be unspecified, metadata-preferred, or data-preferred. "unspecified" is an alias for "data-preferred".
- if any device has an "only" preference, devices that are "unspecified" get no chunks at all. A side-effect of this is that we can't go directly from "all devices unspecified" to "all devices data-only and metadata-only", we have to transition through data-preferred and metadata-preferred until the numbers are high enough for RAID profiles, then flip them to data-only and metadata-only in a second pass.
Make "unspecified" an alias for one of the other 4 types, that can be set at run time (this is my earlier comment https://github.com/btrfs/btrfs-todo/issues/19#issuecomment-767108980). For multi-device data-only and metadata-only setups, we follow these steps:
- set "unspecified" as an alias for data-preferred, so every device is initially data-preferred.
- set data-only and metadata-only preferences on each device.
- set "unspecified" as an alias for data-only to prevent metadata from leaking onto slow devices.
Get rid of "unspecified", make the initial state of every device "data-preferred". This can lead to the wrong chunk type on some devices, but data chunks are far faster to relocate with balance, and it makes the user manual much shorter.

My preference is option #3. Even though it can put data on a SSD sometimes, it is much simpler for users.

kreijack commented 3 years ago

Get rid of "unspecified", make the initial state of every device "data-preferred". This can lead to the wrong chunk type on some devices, but data chunks are far faster to relocate with balance, and it makes the user manual much shorter.

I think that this is the "least surprise" option; I like this

Zygo commented 3 years ago

To simplify I would like to know your opinion about grouping the different levels:

group, in the ordering, the devices tagged by METADATA_ONLY and PREFERRED_METADATA

I can't think of cases where you'd have both -preferred and -only preferences set on devices in a filesystem without also wanting the ordering between them. Conversely, if you don't want the ordering, you also don't need two of the preference levels.

e.g. 1, in the NVME/SSD/HDD case, where there are devices of different speeds, you want metadata on NVME first, SSD last, so metadata-only and metadata-preferred are two distinct groups that should be ordered separately.
e.g. 2 if you had 5 disks of equal speed, you wouldn't bother making 2 of them data-preferred and 3 data-only, you'd just make them all data-preferred or all data-only. Only one of data-preferred or data-only exists in the filesystem, making the other group irrelevant for sorting.

We could dig out my chunk_order variant proposal from the mailing list. That allows specifying device allocation order explicitly, and would allow user control over grouping. For that matter, there's no reason why we couldn't do allocation ordering as a separate feature proposal. In that case, we would use the m-o, m-p, d-p, d-o ordering by default, but if the user gives a different explicit ordering, we use that order instead, then the device preferences only affect what chunk types are allowed on which devices.

MarkRose commented 1 year ago

Is there any effort underway to implement this? I may be interested in contributing

hauuau commented 1 year ago

Are there any news about this feature? The patch series from the last year doesn't apply cleanly to recent kernels anymore.

@MarkRose, There is a patch series by @kreijack on the linux-btrfs mailing list. Try searching its archives for "allocation_hint".

kakra commented 1 year ago

Here's a patchset I'm trying to keep updated for LTS versions: https://github.com/kakra/linux/pull/26

Forza-tng commented 11 months ago

This is a very interesting topic. As a user and sys admin there would be many benefits with the multi level approach.

As a home user I'd say I would like the preferred option, because I am conscious about cost and want to avoid ENOSPC as long as possible. However as a sys-admin I have other resources and priorities, so the *-only would make sense.

In a way I wonder if the current train of thought could be expanded in a more tiered storage way where we can have classes of data too?

NVME = metadata-only SSD = data-tier1-preferred HDD = data-tier2-preferred iSCSI = data-tier3-preferred ... This would mean we'd have to classify data somehow. Perhaps with btrfs property set. Then defrag could be used to move data across tiers.

The biggest value add is the original metadata preference idea. But, depending on design choice, maybe would allow for development of tiered allocation later on?

Zygo commented 11 months ago

I wonder if the current train of thought could be expanded in a more tiered storage way where we can have classes of data too?

In the current patches it's sorting devices by preference, and there's only 2 preference types plus a "consider other preference types" bit. We could easily add more preference type bits and then specify the order for filling devices, or create multiple preference lists explicitly and sort each one differently. These will require a different representation in the metadata (and potentially one with multiple items if e.g. there are a lot of tiers).

The hard part is that with one monolithic "data" type, the patch only has to touch the chunk allocator to decide where to put new chunks. If there are multiple "data" types and a data classification scheme, then we have to start adding hooks in space_info to decide which existing chunks we use to store new data. That has downstream impact on reservations (the "do I have enough space for X" function now has a "what kind of data is X" parameter) and probably lots of other things.

Zygo commented 11 months ago

In the years since this stuff was posted I did find another use case: allocation preference none-only, meaning never allocate anything on the device. Useful for those times when you replace some devices in an array with larger devices, and need to remove multiple old small devices at once.

btrfs device remove would rewrite the entire array for each removed device if striped profiles are in use (raid0, raid56, or raid10). btrfs fi resize could be used to resize all the small devices to zero 1 GiB at a time, but it doesn't prevent btrfs from putting data on devices with holes in unallocated space that aren't at the end of the device. Allocation preference none-only allows a balance or single-device delete to remove data from multiple devices that we want to be empty all at once with zero repeated relocation work.

e.g. we set devices 1-6 to none-only, then device remove won't keep trying to fill up device 2-6 while we're deleting device 1, or fill up devices 3-6 while we're deleting device 2, etc.

I guess none-preferred would make sense to complete the option table, but I don't know what it would be used for. On the other hand, I didn't know what none-only would be used for until I found myself using it.

Tuxist commented 11 months ago

space_info

it's not easier when i set musage to 100 percent on a device than space info says there is no space for data ?

hauuau commented 11 months ago

it's not easier when i set musage to 100 percent on a device

If I understand it correctly "musage" is just a filter on the balance process which selects only block groups with space usage below that percentage for reallocation (balance). Device is another filter there which selects only block groups on that device. Those are not settings which affect allocations in any way beyond forcing blocks which matched those filters to go through the allocation process again during that balance run.

To actually force allocation process to select particular device there is a need to apply patches mentioned above and set allocation hints per device.

kakra commented 10 months ago

If I understand it correctly "musage" [...] Device is another filter...

Yes, you do. They are really filters only to select which chunks to consider. If this selects a chunk which has stripes on different devices, it would still work on both devices even if you filtered for just one device.

Allocation strategy is a completely different thing, not influenced by these filters.

Izzette commented 5 months ago

Why not implement a system assigning each device two integer priority values, one for metadata and one for data? These integer values would dictate the order of storage priority, ensuring devices with higher numerical values are utilized first, and lower values are used as fallback options when space runs out. An administrator would set their desired metadata device with a high metadata priority, and a low data priority. This structure not only facilitates the creation of a dedicated metadata device to prevent premature space exhaustion but also supports a variety of scenarios, including:

Employing a temporary device to manage btrfs balance tasks when the primary disk is too full for chunk reallocation.
Utilizing a temporary device for migrations unrelated to btrfs, with the intention of removing it later.
Redirecting new metadata or data away from a failing device to a more stable one until a replacement is secured.
Establishing tiered storage systems, where a slower, secondary device acts as a reserve when primary devices reach capacity. This might pose challenges in RAID1 setups, similar to how allocation may fail if using RAID1 with devices of varying sizes.

This prioritization system could also benefit future developments in cache device technology, enabling the cohabitation of cache data and metadata on a single device. It would also refine tiered storage strategies with mixed-speed disks, prioritizing specific data on faster devices while caching other data on slower ones.

kakra commented 5 months ago

@Izzette This idea has some flaws:

You cannot prevent some type of data from ever going to a selected member, it would just overflow to the next lower priority.
Implementing tiering in this way is incomplete because for full tiering support, we also need btrfs trees tracking how "hot" data is and migrate it back and forth (hot data should migrate to the faster tiers, cold data should migrate to the slower tiers), either automatically in the background or via a user-space daemon. Trying to force/mis-use this solution for tiering is just wrong.
The current approach is much more deterministic and avoids challenges with raid1 setups.

I think tiering should really be approached by a completely different implementation. This preferred allocation design has its limited set of applications and it works for that. Tiering is not part of this set and cannot be solved with allocation strategies.

Currently, there's already a function in the kernel where the kernel prefers reading data from stripes on non-rotating disks. A future implementation should probably rather track latency to decide which stripe to read, similar to how mdraid does it. But this is completely different from allocation strategies and doesn't belong into allocation strategy design. We should not try to implement tiering as part of an allocation strategy, it's a very different kind of beast. Stripe selection for preferred reading of data is also not part of allocation strategy.

Adding a reserve device is already possible with these patches (but I agree, in a limited way only), and preventing allocation during migration is also possible (using the "none" preference).

I'm currently using these patches, which implement the idea here, to keep meta data on fast devices only, and to prevent content data from going to these devices. High latency of meta data is the biggest contributor to slow btrfs performance. While this sounds like "tiering", it really isn't, although in case of ENOSPC it will overflow into the other type of device. This needs manual cleanup, so it's not tiering. To get a similar effect of tiering, I instead run my data-preferred partitions through bcache. It's not perfect because in theory, we would need to accelerate just one stripe and bcache doesn't know about that, but it works well enough for now.

Just to spin up some ideas but it should probably be discussed in another place:

A proper tiering implementation should probably keep raid1 data on "cold" devices, but add a temporary third (or extra) stripe on a fast device until free space becomes low.
It would then clean up using some LRU approach. Because it stored a third temporary stripe only, this is cheap to cleanup (just discard it, no need to migrate to another disk first which would otherwise spike disk access latency).
For writing data, we could write to fast devices first, then write back that data to slower "cold" devices, discard one stripe (cheap) and marking the other as "temporary" (cheap) so it will take part in the LRU algorithm.
If there's only a single fast device, this background writeback won't work (because raid1), only caching "extra" stripes would work. If there's no more space on the "cold" devices, new stripes would stay on the fast devices and wait for writeback until enough "cold" space is free. The kernel and userspace utils should warn about this.
This needs more thinking if we had more than two classes of devices (e.g. NVMe+SSD+HDD instead of just fast/slow), the above idea of "priorities" comes in handy here.
This is probably more complicated to implement than it sounds here. It's just a draft of ideas.

Currently, the linked patch + bcache mostly implement the above idea except it doesn't know about stripes and does double-caching if different processes read from different stripes, which is kind of inefficient for both performance and storage requirements.

Forza-tng commented 5 months ago

@kakra

we also need btrfs trees tracking how "hot" data is and migrate it back and forth

How about simply having a user space daemon doing this work and submit migration calls, maybe via a modified balance interface? We do not need btrfs handling this kind of thing internally, IMHO.

kakra commented 5 months ago

@Forza-tng

How about simply having a user space daemon doing this work and submit migration calls, maybe via a modified balance interface? We do not need btrfs handling this kind of thing internally, IMHO.

This should work and I actually agree to let a user-space daemon handle this (the kernel should not run such jobs). But how does such a daemon know what is hot or cold data? atimes probably don't work. And maybe it should work at a finer grained level than just "this big file". So, IMHO, we need btrfs record some usage data and provide an interface for user-space to query that, and user-space decides what is hot and what is cold. E.g. a sophisticated user-space daemon could leverage more sources than just usage data from the file system, it also could look a boot file traces, or keep a database in fast storage, or look at memory maps of running processes similar to how the preload daemon does it and predict which files should be hot currently. IOW, a user-space daemon has so much more opportunities. But we need some usage data from the file system, simply tracing all file system calls is probably not the way to go.

Izzette commented 5 months ago

You cannot prevent some type of data from ever going to a selected member, it would just overflow to the next lower priority.

In my opinion, this is a very easy limitation to overcome. Priorities of 0 or lower could be used for exclusion, presenting allocation to the device for that data type all together.

Implementing tiering in this way is incomplete ...

I fully agree, but I can see how it could potential be used as part of a complete tiering solution, for especially for write-back caching.

This needs more thinking if we had more than two classes of devices (e.g. NVMe+SSD+HDD instead of just fast/slow), the above idea of "priorities" comes in handy here.

In fact, this is exactly the case I am encountering / interested in.

My proposal is just an imperfect idea, but having "metadata-only" "data-only" etc. feels limiting, telling the user how to use the tool instead of giving the users the tools they need to fit their use case.

kakra commented 5 months ago

You cannot prevent some type of data from ever going to a selected member, it would just overflow to the next lower priority.

In my opinion, this is a very easy limitation to overcome. Priorities of 0 or lower could be used for exclusion, presenting allocation to the device for that data type all together.

Yes, this would work. But then, it's essentially what we already have: The list of integer values behind the symbols is actually a priority list with exclusion bits. But I think your idea of splitting it into two integers makes sense and is more flexible, and I like the "0" as exclusion idea. OTOH, there are not many free bits in the meta data to store such information thus it has been designed the way it is currently. Everything more complex will need additional btrfs trees to store such data, as far as I understood it.

In fact, this is exactly the case I am encountering / interested in.

Me too, because it will make things much more dynamic. But currently the combination with bcache works for me. E.g., you could use 2x SSD for metadata, mdraid1 NVMe for bcache, and the remaining HDDs will be backends for bcache.

My proposal is just an imperfect idea, but having "metadata-only" "data-only" etc. feels limiting, telling the user how to use the tool instead of giving the users the tools they need to fit their use case.

I think we still should rethink a different solution and not try to force metadata allocation hints into a half-baked tiering solution. Yes, you can use it as a similar behaving solution currently, with some limitations, and your priority idea could make it more flexible. But in the end, it won't become a tiering solution that just dynamically works after your disks filled up: Because data sticks to that initial assignment, performance would suddenly drop or start to jitter because some extents are on fast devices, some on slow. This will happen quite quickly due to cow behavior.

Allocation hints are to be used for a different scenario, and within those limits, you can trick it into acting as a tiering solution, and even a dynamic one, if we put bcache (or lvmcache) into the solution. Don't get me wrong: I'd rather prefer a btrfs-native solution but allocation hinting is not going to work for this because it only affects initial data placement.

In that sense, allocation hinting/priorities is not the correct tool to give users a tool for tiering. We need a different tool. We don't need a fork with some aluminum foil attached to dig a hole, it just would wear off fast and break.

Forza-tng commented 5 months ago

As for the metadata patches go, I think they are pretty good as they are. More users would benefit if we can release them officially, maybe with some fallback logic to handle corner cases. The patches themselves seems very solid.

If we want to change data to into different tiers, then some changes are likely needed. At least some metadata that can be set (on extent, inode or subvol?) to classify data. The allocator then needs to use these properties during extent allocation.

Another possibility is to make balance accept a placement hint. Balance can already use vrange/drange as src argument, so providing a target hint to the allocator could be enough to let user-space handle the logic. This would be suitable for moving cold data to a lower tier, or mostly read data to a upper tier.

kakra commented 5 months ago

Another possibility is to make balance accept a placement hint. Balance can already use vrange/drange as src argument, so providing a target hint to the allocator could be enough to let user-space handle the logic. This would be suitable for moving cold data to a lower tier, or mostly read data to a upper tier.

This would be perfect. But still, how do we know what hot, cold, or read-mostly data is?

I still don't think that tiering should be part of the allocator hinting but given btrfs could record some basic usage stats and offered an interface to a target hint, we are almost there by using a user-space daemon.

Of course, we could use allocator hinting to put new data on a fast tier first, then let user-space migrate to to slower storage over time. But to do that correctly, we need usage stats. But I really have no idea how to implement that in an efficient way because it probably needs to write to the file system itself to persist it in some sort of database or linked structure. Maybe we could borrow some ideas from what the kernel does for RAM with multi-gen LRU or idle-page tracking.

Also, I think meta data should be allowed to migrate to slower storage, too, e.g. if you have lots of meta data hanging around in old snapshots you usually do not touch. This can free up valuable space for tiered caching on the faster drives.

MarkRose commented 5 months ago

There is also the situation of files where low latency is desired but are accessed infrequently, such as the contents of ~/.cache/thumbnails. Having the ability to set allocation hints on a directory could enable that. It may be desirable to place those files on the fastest storage since they are small and don't consume much space.

Or perhaps the user plays games, which are sensitive to loading latency, and wants to prevent the files from being stored on spinning rust, but doesn't want their fastest storage consumed by them (think NVMe/Optane + SSD + HDD). .local/share/steam could be hinted to store on the SSD tier. Similarly, ~/Videos could be hinted to be stored on HDD.

It may also be convenient to place all small files on the fastest tier, similar to how ZFS has the special_small_blocks tunable for datasets. For instance, on the machine I'm typing this on, I have about 6.5 GB of files under 128 KB, which includes a lot of source code.

Maybe intelligence could be added to places new files at different tiers based on "magic tests" similar to how file works. For instance, .gz files and other compressed archives could be stored on spinning rust, since they are not likely to be latency sensitive.

But I really have no idea how to implement that in an efficient way because it probably needs to write to the file system itself to persist it in some sort of database or linked structure.

Could that be implemented with access count and access time? Update the access count whenever the access time is updated, with a max value (maybe 255). If it's been x amount of time since the last access, reset the count to 0, where x could be tunable. User space could examine the access time and count and decide to move the data.

kakra commented 5 months ago

Having the ability to set allocation hints on a directory could enable that.

This is one example why allocation hints should not be used for this. It'll create a plethora of different options and tunables the file system needs to handle. Everything you describe could be easily handled by a user-space daemon which migrates the data. If anything, allocation hints should only handle the initial placement of such data. And maybe it would be enough to handle that per subvolume. But then again: At some point in time, the fast tier will fill up. We really need some measure to find data which can be migrated to slower storage.

But you mention a lot of great ideas, we should keep that in mind for a user-space daemon.

Update the access count whenever the access time is updated, with a max value (maybe 255). If it's been x amount of time since the last access, reset the count to 0, where x could be tunable.

I think on a cow file system, we should try to keep updates to meta data as low as possible. Often, atime updates are completely disabled, or at least set to relatime which updates those times at most once per 24h. So this would not make a very valuable statistic. Also, it would probably slow down reading and increase read latency because every read involves a potentially expensive write - the opposite of what we want to achieve.

For written data, a user-space daemon could possible inspect transaction IDs and look for changed files, similar to how bees tracks newly written data. But this won't help at all for reading data.

Maybe read events could be placed in an in-memory ring buffer, so a user-space daemon could read it and create a useful statistic from it. And if we lose events due to overrun, it really doesn't matter too much. I think it's sufficient to implement read tiering in an best-effort manner, it doesn't need to track everything perfectly.

Forza-tng commented 1 month ago

Monitoring reads and writes could potentially be done using inotify or fs/fanotify. I have also seen projects using eBPF to monitor filesystem changes. It is probably better to have a user-space daemon manging monitoring of access patters than something hardcoded in the kernel.

What are the effects of adding additional modes? We currently have: 0,BTRFS_DEV_ALLOCATION_PREFERRED_DATA 1,BTRFS_DEV_ALLOCATION_PREFERRED_METADATA 2,BTRFS_DEV_ALLOCATION_METADATA_ONLY 3,BTRFS_DEV_ALLOCATION_DATA_ONLY 4,BTRFS_DEV_ALLOCATION_PREFERRED_NONE

For tiering we might need something like BTRFS_DEV_ALLOCATION_DATA_PREFERRED_TIER0 BTRFS_DEV_ALLOCATION_DATA_PREFERRED_TIER1 BTRFS_DEV_ALLOCATION_DATA_PREFERRED_TIER2 BTRFS_DEV_ALLOCATION_DATA_PREFERRED_TIER3 ... BTRFS_DEV_ALLOCATION_DATA_ONLY_TIER3

The problem is, I am guessing, to extend current patch set with additional tiers is that we only have one type of DATA chunks. Perhaps it would be possible to create some kind of chunk metadata, though this would not allow for hot/cold data placement according to tiers. Maybe balance can be used to move data into chunks of different tiers? Additionally, how should raid profiles be handled?

In the end, for my personal use-cases, the current solution works very well and I think that a lot of users would benefit if we could mainline it as it is. I know there are issues with for example free space calculation and possibly sanity checking so that users dont lock themselves out by using bad combinations, etc, but maybe it is enough to solve some of them?

kakra commented 1 month ago

4,BTRFS_DEV_ALLOCATION_PREFERRED_NONE

I'm not sure if this is an official one (as far as the patch set of the original author is "official"). Do you refer to my latest patch in https://github.com/kakra/linux/pull/31?

I think we could easily add tier1,2,3 for data. It would use data disks in the preferred order then. To make it actually useful, the balance API needs an option to hint for one of the tiers. And then we'd need some user-space daemon to watch for new data, observe read frequency, and decide whether it should do a balance with a tier hint.

This would actually implement some sort of write cache because new data would go to the fastest tier first, and then later become demoted to slower tiers. I wonder if we could use generation/transaction IDs to demote old data to slower tiers if the faster tier is filling up above some threshold - except it was recently read.

Forza-tng commented 1 month ago

4,BTRFS_DEV_ALLOCATION_PREFERRED_NONE

I'm not sure if this is an official one (as far as the patch set of the original author is "official"). Do you refer to my latest patch in kakra/linux#31?

Yes, I think prefer none seems logical to have. It is that patch set that iI have been using for some time now.

kakra commented 1 month ago

Yes, I think prefer none seems logical to have. It is your patch set that iI have been using for some time now.

Yeah, I added that because I had a use-case for it: A disk starts to fail in a raid1 btrfs. But I currently do not have a spare-disk I want or can use. It already migrated like 30% to the non-failing disks just by using the system. So it seems to work. :-)

Forza-tng commented 1 month ago

Regarding the original topic, are we any closer to a consensus? Perhaps this should be brought to the mailing list?