apple / swift-algorithms

Commonly used sequence and collection algorithms for Swift
Apache License 2.0
5.92k stars 440 forks source link

Add `paddingStart(with:toCount:)` and `paddingEnd(with:toCount:)` operations to `RangeReplaceableCollection`. #157

Closed Sameesunkaria closed 2 years ago

Sameesunkaria commented 3 years ago

Description

Introduces "padding" methods on RangeReplaceableCollection to expand the collection's length to the desired length by repeating a padding element at the start or at the end. A padding operation is often performed on strings or data to achieve a fixed length output.

This change adds two new operations, paddingStart and paddingEnd and the corresponding mutating variants padStart and padEnd. Apple's Foundation has a similar padding API padding(toLength:withPad:startingAt:) but, it is only available on NSStrings. paddingStart and paddingEnd operate on any Swift RangeReplaceableCollection. There also seems to be some interest for adding a padding method for arrays on the Swift forums, the two newly added methods should address those demands.

A couple of example use-cases are shown below:

let paddedNumber = "96".paddingStart(with: "0", toCount: 4)
// paddedNumber == "0096"

let data = Data([0xde, 0xad, 0xbe, 0xef])
let paddedData = data.paddingEnd(with: 0, toCount: 16)
// paddedData contains 16 bytes

Detailed Design

Two new operations, paddingStart and paddingEnd have been introduced to RangeReplaceableCollection.

extension RangeReplaceableCollection {
    public func paddingStart(
        with element: Element,
        toCount paddedCount: Int
    ) -> Self

    public func paddingEnd(
        with element: Element,
        toCount paddedCount: Int
    ) -> Self
}

Each of the two operations also have a mutating varient.

extension RangeReplaceableCollection {
    public mutating func padStart(
        with element: Element,
        toCount paddedCount: Int)

    public mutating func padEnd(
        with element: Element,
        toCount paddedCount: Int)
}

Padding a collection expands the collection to the desired length by repeating the padding element at the start or the end. In case the collection's length is greater than the paddedCount, the collection is preserved.

Documentation Plan

The documentation is WIP both in code (documentation comments) and the markdown guide. I am looking for additional feedback from the community whether this change belongs in swift-algorithms and if the API or implementation needs to be changed or improved. Complete documentation will be added once the API is finalized.

Test Plan

Unit tests have been added for the newly introduced collection types.

Source Impact

This is a purely additive change.

Checklist

timvermeulen commented 3 years ago

Thanks for your contribution, @Sameesunkaria! This looks excellent, and it's a great fit for this package.

With regard to the implementation strategy, I think the right approach is to make this a wrapper around a Chain2 instance. It's totally fine (and even encouraged!) to have a dependency on other functionality from this package. We already do that in several places, a good example is how FiniteCycle is a thin wrapper around a Product2 instance.

  • An eager implementation on RangeReplaceableCollection.

Mutating variants on RangeReplaceableColleciton would be great to have for situations where you're working with an Array / Data / String instance and want to keep using it, rather than a custom wrapper type that doesn't have the same capabilities. It might even make sense to add non-mutating variants on RangeReplaceableCollection as well, although that's not as obvious to me since the wrapper type will have better performance. This package provides both mutating and non-mutating ways of trimming a collection, it's worth checking that out if you haven't.

For symmetry with the trimmingPrefix / trimmingSuffix methods it could make sense to name these paddingPrefix / paddingSuffix (and mutating versions padPrefix / padSuffix). What do you think about that?

Sameesunkaria commented 3 years ago

It's totally fine (and even encouraged!) to have a dependency on other functionality from this package.

Thanks for pointing that out. Having a wrapper around the existing collections would significantly simplify the implementation.

It might even make sense to add non-mutating variants on RangeReplaceableCollection as well, although that's not as obvious to me since the wrapper type will have better performance.

In an ideal world String(str.lazyPaddingPrefix(with: "0", toCount: 10)) and str.eagerPaddingPrefix(with: "0", toCount: 10) should perform the same right?


I was also inspired by some of your other PRs that are open currently. It may be beneficial to take another sequence for the padding elements:

extension Collection {
    func paddedPrefix<Other: Sequence>(
        with other: Other,
        toCount paddingLength: Int
    ) -> PaddedPrefix<Self, Other> where Other.Element == Element
}

What do you think about it?

There is the obvious question about where should the other sequence start. One might not want the Other sequence to start at the beginning based on the count of the collection; similar to Apple's padding(toLength:withPad:startingAt:).

timvermeulen commented 3 years ago

It might even make sense to add non-mutating variants on RangeReplaceableCollection as well, although that's not as obvious to me since the wrapper type will have better performance.

In an ideal world String(str.lazyPaddingPrefix(with: "0", toCount: 10)) and str.eagerPaddingPrefix(with: "0", toCount: 10) should perform the same right?

A padding method that constructs a new string is always going to be more expensive by necessity than a thin wrapper around a Chain2 instance, because the former needs to allocate a new string and add every character to it while the latter is essentially a no-op.

I was also inspired by some of your other PRs that are open currently. It may be beneficial to take another sequence for the padding elements:

extension Collection {
    func paddedPrefix<Other: Sequence>(
        with other: Other,
        toCount paddingLength: Int
    ) -> PaddedPrefix<Self, Other> where Other.Element == Element
}

What do you think about it?

There is the obvious question about where should the other sequence start. One might not want the Other sequence to start at the beginning based on the count of the collection; similar to Apple's padding(toLength:withPad:startingAt:).

Great point, the Foundation API indeed lets you do more than only pad a string with a single character. We don't need to necessarily add all of the same functionality as that method, though. It's perfectly fine to only allow padding using a single element, and wait and see if people actually want to use multiple elements for padding. For what it's worth, from my limited research it seems common for other programming languages to only supply the single element version out of the box.

timvermeulen commented 3 years ago

We've discussed this internally and we think the lazy wrapper around a Chain2 probably has limited usefulness compared to versions that operate on RangeReplaceableCollections. The most obvious ones are

extension RangeReplaceableCollection {
  func paddingPrefix(with element: Element, toCount paddingLength: Int) -> Self
  func paddingSuffix(with element: Element, toCount paddingLength: Int) -> Self

  mutating func padPrefix(with element: Element, toCount paddingLength: Int)
  mutating func padSuffix(with element: Element, toCount paddingLength: Int)
}

We should probably first find some real-world use cases for a lazy wrapper that is available to all collections, before adding it as well.

timvermeulen commented 3 years ago

And maybe prefix/suffix is not the right terminology to use here because these methods don't operate on the prefix the same way trimmingPrefix does, so how about we go with paddingStart/paddingEnd instead?

Sameesunkaria commented 3 years ago

And maybe prefix/suffix is not the right terminology to use here because these methods don't operate on the prefix the same way trimmingPrefix does, so how about we go with paddingStart/paddingEnd instead?

Good point.

Sameesunkaria commented 3 years ago

I'll benchmark a few implementations for padding on RangeReplaceableCollection and update the PR with the one that performs the best. I'll post the results here.

Sameesunkaria commented 3 years ago

I performed some performance tests for the following three implementations:

extension RangeReplaceableCollection {
  func pad1(with element: Element, toCount paddingCount: Int) -> Self {
    guard count < paddingCount else { return self }
    return Self.init(repeating: element, count: paddingCount - count) + self
  }

  func pad2(with element: Element, toCount paddingCount: Int) -> Self {
    guard count < paddingCount else { return self }
    var paddedCollection = Self.init(repeating: element, count: paddingCount)
    let paddingEndIndex = paddedCollection.index(
      paddedCollection.startIndex,
      offsetBy:paddingCount - count)
    paddedCollection.replaceSubrange(paddedCollection.startIndex..<paddingEndIndex, with: self)
    return paddedCollection
  }

  func pad3(with element: Element, toCount paddingCount: Int) -> Self {
    guard count < paddingCount else { return self }
    var paddedCollection = Self()
    paddedCollection.reserveCapacity(paddingCount)
    paddedCollection.append(contentsOf: repeatElement(element, count: paddingCount - count))
    paddedCollection.append(contentsOf: self)
    return paddedCollection
  }
}

The time taken to pad to twice the length of the original collection was considered for this benchmark. I performed 5 consecutive runs.

example using swift-collections-benchmark:

benchmark.addSimple(
  title: "pad1",
  input: [Int].self
) { input in
  blackHole(input.pad1(with: 0, toCount: input.count * 2))
}

The results are interesting....

chartint

pad1, which I assumed would performed the worse was actually the fastest for inputs from approx. size 512 to 64k. pad2 and pad3 perform similarly.

This leaves me with no clearly correct approach...

What do you think about it @timvermeulen ? Could there be something wrong with the testing strategy?

Sameesunkaria commented 3 years ago

When I add a 4th implementation to benchmark, pad3 ends up being equivalent or faster than pad1. I have repeated this a few times and got similar results.

The four implementations:

extension RangeReplaceableCollection {
  func pad1(with element: Element, toCount paddingCount: Int) -> Self {
    guard count < paddingCount else { return self }
    return Self.init(repeating: element, count: paddingCount - count) + self
  }

  func pad2(with element: Element, toCount paddingCount: Int) -> Self {
    guard count < paddingCount else { return self }
    var paddedCollection = Self.init(repeating: element, count: paddingCount)
    let paddingEndIndex = paddedCollection.index(
      paddedCollection.startIndex,
      offsetBy:paddingCount - count)
    paddedCollection.replaceSubrange(paddedCollection.startIndex..<paddingEndIndex, with: self)
    return paddedCollection
  }

  func pad3(with element: Element, toCount paddingCount: Int) -> Self {
    guard count < paddingCount else { return self }
    var paddedCollection = Self()
    paddedCollection.reserveCapacity(paddingCount)
    paddedCollection.append(contentsOf: repeatElement(element, count: paddingCount - count))
    paddedCollection.append(contentsOf: self)
    return paddedCollection
  }

  func pad4(with element: Element, toCount paddingCount: Int) -> Self {
    guard count < paddingCount else { return self }
    var paddedCollection = self
    paddedCollection.insert(contentsOf: repeatElement(element, count: paddingCount - count), at: paddedCollection.startIndex)
    return paddedCollection
  }
}

The data was recorded over 5 runs.

This is how benchmark was performed: ```swift benchmark.addSimple( title: "pad1", input: [Int].self ) { input in blackHole(input.pad1(with: 0, toCount: input.count * 2)) } benchmark.addSimple( title: "pad2", input: [Int].self ) { input in blackHole(input.pad2(with: 0, toCount: input.count * 2)) } benchmark.addSimple( title: "pad3", input: [Int].self ) { input in blackHole(input.pad3(with: 0, toCount: input.count * 2)) } benchmark.addSimple( title: "pad4", input: [Int].self ) { input in blackHole(input.pad4(with: 0, toCount: input.count * 2)) } ```

pad3 seems to be the best performer from these results.

chart

I can't explain why the results change when another case is added. Could it be related to some additional compiler optimizations or speculative execution?

timvermeulen commented 3 years ago

I can't explain why the results change when another case is added. Could it be related to some additional compiler optimizations or speculative execution?

I'm not completely sure, perhaps @lorentey could comment on this. But let's go with option 3, with the slight modification of computing count only once as it can be an expensive operation.

pad1, which I assumed would performed the worse was actually the fastest for inputs from approx. size 512 to 64k.

I expect most padding operations to be done on collections much smaller than 512 elements, so pad3 shows strong results regardless of which graph we look at, and its implementation is very straightfoward as well.

Sameesunkaria commented 3 years ago

Pushed the changes for padding on RangeReplaceableCollection. Looking for feedback! 🙌

Sameesunkaria commented 3 years ago

Wow, looks like I am horrible at proof reading text 😅 Thanks for the review ✌️

timvermeulen commented 3 years ago

@swift-ci Please test

Sameesunkaria commented 3 years ago

Looks like I forgot to update the description in Readme.

timvermeulen commented 3 years ago

@swift-ci Please test

Sameesunkaria commented 3 years ago

@LucianoPAlmeida @timvermeulen I have updated the implementation for padStart and padEnd.

lorentey commented 3 years ago

I can't explain why the results change when another case is added. Could it be related to some additional compiler optimizations or speculative execution?

I'm not completely sure, perhaps @lorentey could comment on this. But let's go with option 3, with the slight modification of computing count only once as it can be an expensive operation.

Benchmarking is tricky!

My (unverified, largely uninformed) hypothesis is that adding new code can change the layout of the generated binary in a way that affects the effectiveness of the instruction cache in unrelated parts. (The swift benchmark suite (and Utils/run-benchmarks.sh in Swift Collections) uses -align-module-to-page-size to try mitigating this, but it operates on modules, so it's unlikely to have much effect in this case. Moving each task to a separate module wouldn't be very practical.)

(Take this with a grain of salt though -- I have no idea how instruction caches work, and I should probably read up on it.) :-)

LucianoPAlmeida commented 3 years ago

@LucianoPAlmeida @timvermeulen I have updated the implementation for padStart and padEnd.

Looks great @Sameesunkaria :)

LucianoPAlmeida commented 3 years ago

@swift-ci Please test

Sameesunkaria commented 3 years ago

@timvermeulen What are the next steps here?

Sameesunkaria commented 3 years ago

Hey folks! Just checking up. How can I get this PR moving along?

timvermeulen commented 3 years ago

@Sameesunkaria Sorry for the delays, we wanted to get 1.0 out the door first. Looks like this branch needs to be updated, and then let's land this!

Sameesunkaria commented 3 years ago

I have merged in main so its easier to see the diff. But finally I would like to squash everything into a commit if allowed.

LucianoPAlmeida commented 3 years ago

@swift-ci Please test

timvermeulen commented 2 years ago

@xwu Thanks for bringing this up! You're absolutely right that these methods don't necessarily do what they say they do when strings are involved, and we should document this given that these will primarily be used with strings.

That said, passing a combining character to this method, or starting out with a string that starts or ends with a combining character that the padding character combines with, is unusual enough that I don't think it's necessary to change the behavior or signature of any of these methods. Checking the count of the collection at the end of the operation invokes a fairly severe performance penalty that is unnecessary for just about all use cases, and it's unclear what the best way to recover from a count mismatch would be.

xwu commented 2 years ago

Yeah, I guess I'm just voicing that, personally, I feel leery of providing 'pad' methods that are clearly geared towards strings (even with the name) that don't--and as you say, can't practically--do what they say they do. Perhaps if they were not named and documented in a way that invites such use it would be less glaring of an issue, and then it's not the behavior that needs modifying.

But, recalling that the objective of this project (as stated) is to vend canonical implementations that "improve the correctness and performance" of user code... In no small part that comes from the canonical implementation handling the "unusual enough" corner cases that as the end user you're likely to have thought of when rolling your own, whether that's a correctness issue or a performance cliff.

Sameesunkaria commented 2 years ago

For what it's worth; using the existing implementation, if I try to padEnd Cafe to the length of 10 using an acute accent (\u{0301}), the final count of the string remains unchanged at 4.

All the combining characters I tried showed the same behavior. So, I can't really see a way to ever "pad" a string to the desired length in Swift when the padding character is free to be any Unicode character.

xwu commented 2 years ago

@Sameesunkaria For what it's worth, I'm not overly excited about the case where you're using an isolated combining character like the acute accent, which is (I believe) considered a "degenerate" case that explicitly shouldn't have special handling. If that were all, then we would be right to make no attempt to address the problem, because that's what Unicode directs us to do regarding "degenerate" cases.

But not all cases where strings don't pad the way you'd expect involve "degenerate" cases: this is something that occurs even when concatenating two sequences of well-formed Unicode grapheme clusters. For example (as we've just been discussing on the forums about character classes), regional indicators are standalone grapheme clusters because they're classified as symbols. So "🇧" is a well-formed Unicode string with character count 1. However, if you pad a string with it, you're only going to get half as much padding as you'd expect, because two of these back-to-back encode the flag for Barbados: "🇧🇧".

(Ironically, I'm on a Windows machine at the moment, and there's no emoji symbol for that flag in the system font, so I see two B's but the caret correctly skips them both when I advance with the right arrow.)

Sameesunkaria commented 2 years ago

This cannot be determined by induction because some characters will combine only once or a finite number of times and others infinitely.

Based on this comment, if for any well-formed string we cannot generate a string of the desired length with the specified well-formed padding character, then would it even be possible to write a padding function in Swift?

If not, we may consider renaming the method to better describe its actual behavior.

Sameesunkaria commented 2 years ago

So, folks where do we go from here?

Based on the discussions we had above, I see three possible options. I have gathered some comments here.

  1. Keep the same implementation (possibly changing the name) and document the behavior for strings. However, I agree with @xwu that we must be careful while providing such APIs.

    [...] I feel leery of providing 'pad' methods that are clearly geared towards strings (even with the name) that don't--and as you say, can't practically--do what they say they do. Perhaps if they were not named and documented in a way that invites such use it would be less glaring of an issue, and then it's not the behavior that needs modifying.   But, recalling that the objective of this project (as stated) is to vend canonical implementations that " improve the correctness and performance](https://swift.org/blog/swift-algorithms/)" of user code... In no small part that comes from the canonical implementation handling the "unusual enough" corner cases [...]   https://github.com/apple/swift-algorithms/pull/157#issuecomment-948062591

  2. Find a way to add additional elements if the desired count is not reached (may be a StringProtocol specific implementation). This approach also has its caveats; the most prominent one being that we don't know if it's possible to implement. Performance being another issue. Finally if we decide to go for a best-effort approach where we try to get close to the final count, what would qualify as best-effort?

    [...] consideration needs to be given as to the desired behavior if the supplied padding is a single character that combines entirely with the next or previous character such that the desired count can never be reached because adding an infinite number of the combining character still leaves the string with the same character count. This cannot be determined by induction because some characters will combine only once or a finite number of times and others infinitely.   https://github.com/apple/swift-algorithms/pull/157#pullrequestreview-752167700

[...] Checking the count of the collection at the end of the operation invokes a fairly severe performance penalty that is unnecessary for just about all use cases, and it's unclear what the best way to recover from a count mismatch would be.   https://github.com/apple/swift-algorithms/pull/157#issuecomment-947610660

  1. Acknowledge defeat. A padding API is often exclusively used with strings. IMO dismissing the behavior of this API for strings goes agains the objective of this project (also pointed out by @xwu). Since Swift String's conformance to Collection is implemented on the character view, we may never be able to provide a correct string padding API.

What do you think about it @timvermeulen @xwu?

xwu commented 2 years ago

I don't make the calls here, but personally I think if we can find a non-leading name for this feature (in the legal sense of "leading" as in suggestive of something--here, suggestive of being appropriate for strings), and so definitely not using the term "pad," and we document very well and point out how well-formed strings (not just you examples using degenerate cases) can have unusual behavior, then it could work out.

A question to consider is, how often to folks want to pad non-strings? If the answer is comparatively much more rarely than for strings, then the alternative of vending a proper string-specific implementation--and I'd imagine in the standard library rather than here (since it would require investigating whether there are grapheme-breaking APIs we need so that this can be done without iterative guess-and-check and potential infinite loops) would be the way to go.