btrfs / btrfs-todo

An issues only repo to organize our TODO items
21 stars 2 forks source link

Preferred metadata design #19

Open josefbacik opened 3 years ago

josefbacik commented 3 years ago

We need to agree on how this system will work before we start writing code again. The main goal is to provide users the ability to specify a disk (or set of disks) to dedicate for metadata. There's a few policy questions that need to be asked here. Please discuss with new comments, and as we come to a consensus I will update this main entry with the conclusion of our discussions.

Policy questions

Userspace implementation

Kernel implementation

josefbacik commented 3 years ago

For me

Zygo commented 3 years ago

Policy questions

Userspace implementation

Kernel implementation

josefbacik commented 3 years ago

Ok so you envision something more generic than simply "allocate metadata on metadata preferred devices", and more classifying devices on a spectrum, and using that spectrum to infer policy.

I actually like that idea better, because it removes the need to have a separate flag to indicate what the policy is, we simply tag devices with their policy and carry on. From the point of view of the user it would map out roughly like this

# mkfs
mkfs.btrfs --data-preferred /dev/sdb,/dev/sdc --metadata-preferred /dev/sdd,/dev/sde --metadata-only /dev/nvme0n1

# on the fly
echo 'metadata-only' > /sys/fs/btrfs/<uuid>/allocation-preferences/sdc/preference

cat /sys/fs/btrfs/<uuid>/allocation-preferences/sdc/preferences
none [metadata-only] metadata-preferred data-only data-preferred

# unmounted
btrfs device preference --set 'metadata-only' /dev/nvme0n1

# mounted
btrfs device preference --set 'metadata-only' /dev/nvme0n1 /path/to/fs

For this the on disk implementation would simply be a new item on the device tree per device. Implementation wise I would do something like

I think that's it code wise.

Zygo commented 3 years ago

Sounds good. One question: What does "no preference" mean? metadata-only, metadata-preferred, data-preferred, and data-only imply an ordering, so what is "none" and where does it fit in the order?

Or is "none" an alias to one of the others and the default value for new devices, e.g.

# default filesystem with no preferences
cat /sys/fs/btrfs/<uuid>/allocation-preferences/default/preference
metadata-only metadata-preferred data-only [data-preferred]

# set our fastest devices to metadata-only, all other devices to data-only
echo metadata-only | tee /sys/fs/btrfs/<uuid>/allocation-preferences/nvme*/preference
echo data-only | tee /sys/fs/btrfs/<uuid>/allocation-preferences/sd*/preference

# change what 'none' means, so metadata doesn't leak onto new devices with default preferences:
echo data-only > /sys/fs/btrfs/<uuid>/allocation-preferences/default/preference

and if so, wouldn't it be better to call it "default" instead of "none"?

kreijack commented 3 years ago

Policy questions

ENOSPC

Do we fail when we run out of chunks on the metadata disks?

IMHO no, a lower performance is far better than a metadata -ENOSPC. Anyway we should create a logging system to warn the user about this situation. (like the one that warns the user if there are different profiles in the same filesystem).

Do we allow the use of non-metadata disks once the metadata disk is full?

As above, yes. However there are some corner cases that have to be addressed. Supposing to have:

Normal case: data spans sd[cde], metadata spans sd[ab]. What should happen if sdd is full: 1) return -ENOSPC 2) should data span sd[abce] or span only 3 disks ? (the data could have even more space )

Do we allow the user to specify the behavior in this case?

I think no. However I am open to change my idea if there is a specific user case.

Device Replace/Removal

Does the preferred flag follow the device that's being replaced?

In the general case no, I don't see any obious behavior; we can replace a faster disk with a slower one. However btrfs disk replace/remove should warn the user about the possibles risks

What do you do if you remove the only preferred device in the case that we hard ENOSPC if there are no metadata disks with free space?

I think that the "preferred" is just an hint. I prefer a better btrfs-progs that warn the user about this situations (preferred disks full)

Userspace implementation

sysfs interface for viewing current status of all elements.

Definitely yes

sysfs interface at least for setting any policy related settings.

I agree, the only limits is that is difficult to implement an "atomic change of multiple values". I don't know if this is the case however.

A btrfs command for setting a preferred disk. The ability to set this at mkfs time without the file system mounted.

It does make sense. Anyway my suggestion is still to allow a flag to the mount command to set a "standard" behavior in an emergence situation. May be that the mount options are "transitory"; instead the sysfs are "permanentely"

Kernel implementation

The device settings need to be persistent (ie which device is preferred for the metadata).

Agree

The policy settings must also be persistent.

Agree

The question of how to store this is an open question.

The xattr was a fascinating idea. Unfortunately it suffers of two problems:

The other way is the one which was used to store the default subvolume. The only differences is to use an extendible (and versioned) structure to hold several parameters (even not related). A "dirty" flag marks the structure to be committed in the next transaction (see btrfs_feature_attr_store() ).

Zygo commented 3 years ago

This is one area where Goffredo and I disagree. I have use cases where there absolutely must not be data or metadata on a non-preferred device type. In my examples above "metadata-only" and "data-only" get used more often than "metadata-preferred" or "data-preferred."

In my test cases I never use metadata-preferred or data-preferred at all. I could live with just metadata-only and data-only, but I know others can't, so I included metadata-preferred and data-preferred in all variations of my proposal.

On a server with 250GB of NVME and 50TB of spinning disk, It's a 95% performance hit to put metadata on a non-preferred device, and a 0.5% space gain to use preferred devices for data. That tradeoff is insane, we never want that to happen and we'd rather have it just not be possible. We're adults, we can buy NVME devices big enough for our metadata requirements.

josefbacik commented 3 years ago

I'm with Zygo here, having first hand had teams tell me they'd rather have X fall over hard than slowly degrade. I think its valuable to have the metadata-preferred/data-preferred model for people who want the perf boost+safety, but it's equally valuable for Zygo and other people who would rather it be fast and plan appropriately than to have it slowly break.

kreijack commented 3 years ago

Ok, let me to summarize the algorithm. We have 5 disk classes:

The above ordering is for metadata. For data, the ordering is reversed. The disks are ordered first by level and then by available space. After the sorting the last "*_ONLY" disks are excluded from the consideration by the allocator.

The allocator takes the disks from the first group. If these are not enough (ndevs < devs_min), it extends the list of disks using the second group. If these are not enough, it extends the list of available disks up to the 3rd group. If these are not enough, it extends the list of available disks up to the 4rd group.

To simplify I would like to know your opinion about grouping the different levels:

Zygo commented 3 years ago

The "disk with no hints" case is a problem. If we have split the filesystem into disks that only have data or metadata, but we add a new disk, btrfs might put the wrong type of chunk on it.

There's a few solutions:

  1. Require that of the following must be true:
    • all devices must be unspecified, metadata-preferred, or data-preferred. "unspecified" is an alias for "data-preferred".
    • if any device has an "only" preference, devices that are "unspecified" get no chunks at all. A side-effect of this is that we can't go directly from "all devices unspecified" to "all devices data-only and metadata-only", we have to transition through data-preferred and metadata-preferred until the numbers are high enough for RAID profiles, then flip them to data-only and metadata-only in a second pass.
  2. Make "unspecified" an alias for one of the other 4 types, that can be set at run time (this is my earlier comment https://github.com/btrfs/btrfs-todo/issues/19#issuecomment-767108980). For multi-device data-only and metadata-only setups, we follow these steps:
    • set "unspecified" as an alias for data-preferred, so every device is initially data-preferred.
    • set data-only and metadata-only preferences on each device.
    • set "unspecified" as an alias for data-only to prevent metadata from leaking onto slow devices.
  3. Get rid of "unspecified", make the initial state of every device "data-preferred". This can lead to the wrong chunk type on some devices, but data chunks are far faster to relocate with balance, and it makes the user manual much shorter.

My preference is option #3. Even though it can put data on a SSD sometimes, it is much simpler for users.

kreijack commented 3 years ago

Get rid of "unspecified", make the initial state of every device "data-preferred". This can lead to the wrong chunk type on some devices, but data chunks are far faster to relocate with balance, and it makes the user manual much shorter.

I think that this is the "least surprise" option; I like this

Zygo commented 3 years ago

To simplify I would like to know your opinion about grouping the different levels:

group, in the ordering, the devices tagged by METADATA_ONLY and PREFERRED_METADATA

I can't think of cases where you'd have both -preferred and -only preferences set on devices in a filesystem without also wanting the ordering between them. Conversely, if you don't want the ordering, you also don't need two of the preference levels.

We could dig out my chunk_order variant proposal from the mailing list. That allows specifying device allocation order explicitly, and would allow user control over grouping. For that matter, there's no reason why we couldn't do allocation ordering as a separate feature proposal. In that case, we would use the m-o, m-p, d-p, d-o ordering by default, but if the user gives a different explicit ordering, we use that order instead, then the device preferences only affect what chunk types are allowed on which devices.

MarkRose commented 1 year ago

Is there any effort underway to implement this? I may be interested in contributing

hauuau commented 1 year ago

Are there any news about this feature? The patch series from the last year doesn't apply cleanly to recent kernels anymore.

@MarkRose, There is a patch series by @kreijack on the linux-btrfs mailing list. Try searching its archives for "allocation_hint".

kakra commented 1 year ago

Here's a patchset I'm trying to keep updated for LTS versions: https://github.com/kakra/linux/pull/26

Forza-tng commented 11 months ago

This is a very interesting topic. As a user and sys admin there would be many benefits with the multi level approach.

As a home user I'd say I would like the preferred option, because I am conscious about cost and want to avoid ENOSPC as long as possible. However as a sys-admin I have other resources and priorities, so the *-only would make sense.

In a way I wonder if the current train of thought could be expanded in a more tiered storage way where we can have classes of data too?

NVME = metadata-only SSD = data-tier1-preferred HDD = data-tier2-preferred iSCSI = data-tier3-preferred ... This would mean we'd have to classify data somehow. Perhaps with btrfs property set. Then defrag could be used to move data across tiers.

The biggest value add is the original metadata preference idea. But, depending on design choice, maybe would allow for development of tiered allocation later on?

Zygo commented 11 months ago

I wonder if the current train of thought could be expanded in a more tiered storage way where we can have classes of data too?

In the current patches it's sorting devices by preference, and there's only 2 preference types plus a "consider other preference types" bit. We could easily add more preference type bits and then specify the order for filling devices, or create multiple preference lists explicitly and sort each one differently. These will require a different representation in the metadata (and potentially one with multiple items if e.g. there are a lot of tiers).

The hard part is that with one monolithic "data" type, the patch only has to touch the chunk allocator to decide where to put new chunks. If there are multiple "data" types and a data classification scheme, then we have to start adding hooks in space_info to decide which existing chunks we use to store new data. That has downstream impact on reservations (the "do I have enough space for X" function now has a "what kind of data is X" parameter) and probably lots of other things.

Zygo commented 11 months ago

In the years since this stuff was posted I did find another use case: allocation preference none-only, meaning never allocate anything on the device. Useful for those times when you replace some devices in an array with larger devices, and need to remove multiple old small devices at once.

btrfs device remove would rewrite the entire array for each removed device if striped profiles are in use (raid0, raid56, or raid10). btrfs fi resize could be used to resize all the small devices to zero 1 GiB at a time, but it doesn't prevent btrfs from putting data on devices with holes in unallocated space that aren't at the end of the device. Allocation preference none-only allows a balance or single-device delete to remove data from multiple devices that we want to be empty all at once with zero repeated relocation work.

e.g. we set devices 1-6 to none-only, then device remove won't keep trying to fill up device 2-6 while we're deleting device 1, or fill up devices 3-6 while we're deleting device 2, etc.

I guess none-preferred would make sense to complete the option table, but I don't know what it would be used for. On the other hand, I didn't know what none-only would be used for until I found myself using it.

Tuxist commented 11 months ago

space_info

it's not easier when i set musage to 100 percent on a device than space info says there is no space for data ?

hauuau commented 11 months ago

it's not easier when i set musage to 100 percent on a device

If I understand it correctly "musage" is just a filter on the balance process which selects only block groups with space usage below that percentage for reallocation (balance). Device is another filter there which selects only block groups on that device. Those are not settings which affect allocations in any way beyond forcing blocks which matched those filters to go through the allocation process again during that balance run.

To actually force allocation process to select particular device there is a need to apply patches mentioned above and set allocation hints per device.

kakra commented 10 months ago

If I understand it correctly "musage" [...] Device is another filter...

Yes, you do. They are really filters only to select which chunks to consider. If this selects a chunk which has stripes on different devices, it would still work on both devices even if you filtered for just one device.

Allocation strategy is a completely different thing, not influenced by these filters.

Izzette commented 5 months ago

Why not implement a system assigning each device two integer priority values, one for metadata and one for data? These integer values would dictate the order of storage priority, ensuring devices with higher numerical values are utilized first, and lower values are used as fallback options when space runs out. An administrator would set their desired metadata device with a high metadata priority, and a low data priority. This structure not only facilitates the creation of a dedicated metadata device to prevent premature space exhaustion but also supports a variety of scenarios, including:

This prioritization system could also benefit future developments in cache device technology, enabling the cohabitation of cache data and metadata on a single device. It would also refine tiered storage strategies with mixed-speed disks, prioritizing specific data on faster devices while caching other data on slower ones.

kakra commented 5 months ago

@Izzette This idea has some flaws:

  1. You cannot prevent some type of data from ever going to a selected member, it would just overflow to the next lower priority.
  2. Implementing tiering in this way is incomplete because for full tiering support, we also need btrfs trees tracking how "hot" data is and migrate it back and forth (hot data should migrate to the faster tiers, cold data should migrate to the slower tiers), either automatically in the background or via a user-space daemon. Trying to force/mis-use this solution for tiering is just wrong.
  3. The current approach is much more deterministic and avoids challenges with raid1 setups.

I think tiering should really be approached by a completely different implementation. This preferred allocation design has its limited set of applications and it works for that. Tiering is not part of this set and cannot be solved with allocation strategies.

Currently, there's already a function in the kernel where the kernel prefers reading data from stripes on non-rotating disks. A future implementation should probably rather track latency to decide which stripe to read, similar to how mdraid does it. But this is completely different from allocation strategies and doesn't belong into allocation strategy design. We should not try to implement tiering as part of an allocation strategy, it's a very different kind of beast. Stripe selection for preferred reading of data is also not part of allocation strategy.

Adding a reserve device is already possible with these patches (but I agree, in a limited way only), and preventing allocation during migration is also possible (using the "none" preference).

I'm currently using these patches, which implement the idea here, to keep meta data on fast devices only, and to prevent content data from going to these devices. High latency of meta data is the biggest contributor to slow btrfs performance. While this sounds like "tiering", it really isn't, although in case of ENOSPC it will overflow into the other type of device. This needs manual cleanup, so it's not tiering. To get a similar effect of tiering, I instead run my data-preferred partitions through bcache. It's not perfect because in theory, we would need to accelerate just one stripe and bcache doesn't know about that, but it works well enough for now.

Just to spin up some ideas but it should probably be discussed in another place:

Currently, the linked patch + bcache mostly implement the above idea except it doesn't know about stripes and does double-caching if different processes read from different stripes, which is kind of inefficient for both performance and storage requirements.

Forza-tng commented 5 months ago

@kakra

we also need btrfs trees tracking how "hot" data is and migrate it back and forth

How about simply having a user space daemon doing this work and submit migration calls, maybe via a modified balance interface? We do not need btrfs handling this kind of thing internally, IMHO.

kakra commented 5 months ago

@Forza-tng

How about simply having a user space daemon doing this work and submit migration calls, maybe via a modified balance interface? We do not need btrfs handling this kind of thing internally, IMHO.

This should work and I actually agree to let a user-space daemon handle this (the kernel should not run such jobs). But how does such a daemon know what is hot or cold data? atimes probably don't work. And maybe it should work at a finer grained level than just "this big file". So, IMHO, we need btrfs record some usage data and provide an interface for user-space to query that, and user-space decides what is hot and what is cold. E.g. a sophisticated user-space daemon could leverage more sources than just usage data from the file system, it also could look a boot file traces, or keep a database in fast storage, or look at memory maps of running processes similar to how the preload daemon does it and predict which files should be hot currently. IOW, a user-space daemon has so much more opportunities. But we need some usage data from the file system, simply tracing all file system calls is probably not the way to go.

Izzette commented 5 months ago

You cannot prevent some type of data from ever going to a selected member, it would just overflow to the next lower priority.

In my opinion, this is a very easy limitation to overcome. Priorities of 0 or lower could be used for exclusion, presenting allocation to the device for that data type all together.

Implementing tiering in this way is incomplete ...

I fully agree, but I can see how it could potential be used as part of a complete tiering solution, for especially for write-back caching.

This needs more thinking if we had more than two classes of devices (e.g. NVMe+SSD+HDD instead of just fast/slow), the above idea of "priorities" comes in handy here.

In fact, this is exactly the case I am encountering / interested in.

My proposal is just an imperfect idea, but having "metadata-only" "data-only" etc. feels limiting, telling the user how to use the tool instead of giving the users the tools they need to fit their use case.

kakra commented 5 months ago

You cannot prevent some type of data from ever going to a selected member, it would just overflow to the next lower priority.

In my opinion, this is a very easy limitation to overcome. Priorities of 0 or lower could be used for exclusion, presenting allocation to the device for that data type all together.

Yes, this would work. But then, it's essentially what we already have: The list of integer values behind the symbols is actually a priority list with exclusion bits. But I think your idea of splitting it into two integers makes sense and is more flexible, and I like the "0" as exclusion idea. OTOH, there are not many free bits in the meta data to store such information thus it has been designed the way it is currently. Everything more complex will need additional btrfs trees to store such data, as far as I understood it.

In fact, this is exactly the case I am encountering / interested in.

Me too, because it will make things much more dynamic. But currently the combination with bcache works for me. E.g., you could use 2x SSD for metadata, mdraid1 NVMe for bcache, and the remaining HDDs will be backends for bcache.

My proposal is just an imperfect idea, but having "metadata-only" "data-only" etc. feels limiting, telling the user how to use the tool instead of giving the users the tools they need to fit their use case.

I think we still should rethink a different solution and not try to force metadata allocation hints into a half-baked tiering solution. Yes, you can use it as a similar behaving solution currently, with some limitations, and your priority idea could make it more flexible. But in the end, it won't become a tiering solution that just dynamically works after your disks filled up: Because data sticks to that initial assignment, performance would suddenly drop or start to jitter because some extents are on fast devices, some on slow. This will happen quite quickly due to cow behavior.

Allocation hints are to be used for a different scenario, and within those limits, you can trick it into acting as a tiering solution, and even a dynamic one, if we put bcache (or lvmcache) into the solution. Don't get me wrong: I'd rather prefer a btrfs-native solution but allocation hinting is not going to work for this because it only affects initial data placement.

In that sense, allocation hinting/priorities is not the correct tool to give users a tool for tiering. We need a different tool. We don't need a fork with some aluminum foil attached to dig a hole, it just would wear off fast and break.

Forza-tng commented 5 months ago

As for the metadata patches go, I think they are pretty good as they are. More users would benefit if we can release them officially, maybe with some fallback logic to handle corner cases. The patches themselves seems very solid.

If we want to change data to into different tiers, then some changes are likely needed. At least some metadata that can be set (on extent, inode or subvol?) to classify data. The allocator then needs to use these properties during extent allocation.

Another possibility is to make balance accept a placement hint. Balance can already use vrange/drange as src argument, so providing a target hint to the allocator could be enough to let user-space handle the logic. This would be suitable for moving cold data to a lower tier, or mostly read data to a upper tier.

kakra commented 5 months ago

Another possibility is to make balance accept a placement hint. Balance can already use vrange/drange as src argument, so providing a target hint to the allocator could be enough to let user-space handle the logic. This would be suitable for moving cold data to a lower tier, or mostly read data to a upper tier.

This would be perfect. But still, how do we know what hot, cold, or read-mostly data is?

I still don't think that tiering should be part of the allocator hinting but given btrfs could record some basic usage stats and offered an interface to a target hint, we are almost there by using a user-space daemon.

Of course, we could use allocator hinting to put new data on a fast tier first, then let user-space migrate to to slower storage over time. But to do that correctly, we need usage stats. But I really have no idea how to implement that in an efficient way because it probably needs to write to the file system itself to persist it in some sort of database or linked structure. Maybe we could borrow some ideas from what the kernel does for RAM with multi-gen LRU or idle-page tracking.

Also, I think meta data should be allowed to migrate to slower storage, too, e.g. if you have lots of meta data hanging around in old snapshots you usually do not touch. This can free up valuable space for tiered caching on the faster drives.

MarkRose commented 5 months ago

There is also the situation of files where low latency is desired but are accessed infrequently, such as the contents of ~/.cache/thumbnails. Having the ability to set allocation hints on a directory could enable that. It may be desirable to place those files on the fastest storage since they are small and don't consume much space.

Or perhaps the user plays games, which are sensitive to loading latency, and wants to prevent the files from being stored on spinning rust, but doesn't want their fastest storage consumed by them (think NVMe/Optane + SSD + HDD). .local/share/steam could be hinted to store on the SSD tier. Similarly, ~/Videos could be hinted to be stored on HDD.

It may also be convenient to place all small files on the fastest tier, similar to how ZFS has the special_small_blocks tunable for datasets. For instance, on the machine I'm typing this on, I have about 6.5 GB of files under 128 KB, which includes a lot of source code.

Maybe intelligence could be added to places new files at different tiers based on "magic tests" similar to how file works. For instance, .gz files and other compressed archives could be stored on spinning rust, since they are not likely to be latency sensitive.

But I really have no idea how to implement that in an efficient way because it probably needs to write to the file system itself to persist it in some sort of database or linked structure.

Could that be implemented with access count and access time? Update the access count whenever the access time is updated, with a max value (maybe 255). If it's been x amount of time since the last access, reset the count to 0, where x could be tunable. User space could examine the access time and count and decide to move the data.

kakra commented 5 months ago

Having the ability to set allocation hints on a directory could enable that.

This is one example why allocation hints should not be used for this. It'll create a plethora of different options and tunables the file system needs to handle. Everything you describe could be easily handled by a user-space daemon which migrates the data. If anything, allocation hints should only handle the initial placement of such data. And maybe it would be enough to handle that per subvolume. But then again: At some point in time, the fast tier will fill up. We really need some measure to find data which can be migrated to slower storage.

But you mention a lot of great ideas, we should keep that in mind for a user-space daemon.

Update the access count whenever the access time is updated, with a max value (maybe 255). If it's been x amount of time since the last access, reset the count to 0, where x could be tunable.

I think on a cow file system, we should try to keep updates to meta data as low as possible. Often, atime updates are completely disabled, or at least set to relatime which updates those times at most once per 24h. So this would not make a very valuable statistic. Also, it would probably slow down reading and increase read latency because every read involves a potentially expensive write - the opposite of what we want to achieve.

For written data, a user-space daemon could possible inspect transaction IDs and look for changed files, similar to how bees tracks newly written data. But this won't help at all for reading data.

Maybe read events could be placed in an in-memory ring buffer, so a user-space daemon could read it and create a useful statistic from it. And if we lose events due to overrun, it really doesn't matter too much. I think it's sufficient to implement read tiering in an best-effort manner, it doesn't need to track everything perfectly.

Forza-tng commented 1 month ago

Monitoring reads and writes could potentially be done using inotify or fs/fanotify. I have also seen projects using eBPF to monitor filesystem changes. It is probably better to have a user-space daemon manging monitoring of access patters than something hardcoded in the kernel.

What are the effects of adding additional modes? We currently have: 0,BTRFS_DEV_ALLOCATION_PREFERRED_DATA 1,BTRFS_DEV_ALLOCATION_PREFERRED_METADATA 2,BTRFS_DEV_ALLOCATION_METADATA_ONLY 3,BTRFS_DEV_ALLOCATION_DATA_ONLY 4,BTRFS_DEV_ALLOCATION_PREFERRED_NONE

For tiering we might need something like BTRFS_DEV_ALLOCATION_DATA_PREFERRED_TIER0 BTRFS_DEV_ALLOCATION_DATA_PREFERRED_TIER1 BTRFS_DEV_ALLOCATION_DATA_PREFERRED_TIER2 BTRFS_DEV_ALLOCATION_DATA_PREFERRED_TIER3 ... BTRFS_DEV_ALLOCATION_DATA_ONLY_TIER3

The problem is, I am guessing, to extend current patch set with additional tiers is that we only have one type of DATA chunks. Perhaps it would be possible to create some kind of chunk metadata, though this would not allow for hot/cold data placement according to tiers. Maybe balance can be used to move data into chunks of different tiers? Additionally, how should raid profiles be handled?

In the end, for my personal use-cases, the current solution works very well and I think that a lot of users would benefit if we could mainline it as it is. I know there are issues with for example free space calculation and possibly sanity checking so that users dont lock themselves out by using bad combinations, etc, but maybe it is enough to solve some of them?

kakra commented 1 month ago

4,BTRFS_DEV_ALLOCATION_PREFERRED_NONE

I'm not sure if this is an official one (as far as the patch set of the original author is "official"). Do you refer to my latest patch in https://github.com/kakra/linux/pull/31?

I think we could easily add tier1,2,3 for data. It would use data disks in the preferred order then. To make it actually useful, the balance API needs an option to hint for one of the tiers. And then we'd need some user-space daemon to watch for new data, observe read frequency, and decide whether it should do a balance with a tier hint.

This would actually implement some sort of write cache because new data would go to the fastest tier first, and then later become demoted to slower tiers. I wonder if we could use generation/transaction IDs to demote old data to slower tiers if the faster tier is filling up above some threshold - except it was recently read.

Forza-tng commented 1 month ago

4,BTRFS_DEV_ALLOCATION_PREFERRED_NONE

I'm not sure if this is an official one (as far as the patch set of the original author is "official"). Do you refer to my latest patch in kakra/linux#31?

Yes, I think prefer none seems logical to have. It is that patch set that iI have been using for some time now.

kakra commented 1 month ago

Yes, I think prefer none seems logical to have. It is your patch set that iI have been using for some time now.

Yeah, I added that because I had a use-case for it: A disk starts to fail in a raid1 btrfs. But I currently do not have a spare-disk I want or can use. It already migrated like 30% to the non-failing disks just by using the system. So it seems to work. :-)

Forza-tng commented 1 month ago

Regarding the original topic, are we any closer to a consensus? Perhaps this should be brought to the mailing list?