Humanizr / Humanizer

Humanizer meets all your .NET needs for manipulating and displaying strings, enums, dates, times, timespans, numbers and quantities
Other
8.68k stars 964 forks source link

Add truncator to truncate words after (before) specified amount of characters. #136

Open hazzik opened 10 years ago

hazzik commented 10 years ago

The idea is to leave last word unbreaked.

"Text with more characters than truncate length"
    .Truncate(12, "...", Truncator.NumberOfCharacters /* Name TBD */);

=> "Text with more..." or =>"Text with..."

I'm not sure about the name of strategy. Also I'm not sure what to do with long words.

MehdiK commented 10 years ago

Not a bad idea. As you said this could get a bit tricky on long words; e.g. enthusiastically is 16 chars long!

Here is some crazy long names for the strategies: Truncator.NoMoreWordsAfterNumberOfCharacters or Truncator.NoMoreWordsBeforeNumberOfCharacters?

dtp263 commented 10 years ago

Im close to finishing this but just want to clarify. "Hello Beautiful World".Truncate(8, "...", NoMoreWordsAfterNumberOfCharacters) ; Will truncate to --> "Hello Beautiful..."

"Hello Beautiful World".Truncate(8, "...", NoMoreWordsBeforeNumberOfCharacters) ; Will truncate to --> "Hello..."

MehdiK commented 10 years ago

Correct.

Still unsure about the names! They're a bit too long and awkward.

justin-edwards commented 10 years ago

How about MaximumLengthWithoutWordBreak(for truncating at the word before) and MaximumLengthCompleteCurrentWord(for truncating after the word that the maximum lands in)? Still long but a little less awkward.

MehdiK commented 10 years ago

Thanks @justin-edwards. I like this better although the word maximum in the name is a bit confusing at the first sight.

alextercete commented 10 years ago

I've just picked this one up. I'm thinking about changing the names, though. What do you think about DiscardIncompleteWord and FinishIncompleteWord?

alextercete commented 10 years ago

Also, it's not clear to me if this should work with a fixed number of characters or a fixed length. I'm thinking both would make sense. That being said, I'd like to propose that truncating strategy can be combined with incomplete word handling, as such:

"Text with more characters than truncate length"
    .Truncate(12, Truncator.FixedLength.DiscardingIncompleteWord());

=> "Text with…"

"Text with more characters than truncate length"
    .Truncate(12, Truncator.FixedLength.FinishingIncompleteWord());

=> "Text with more…"

DiscardingIncompleteWord and FinishingIncompleteWord would be extension methods that could be applied to both Truncator.FixedLength and Truncator.FixedNumberOfCharacters (but only to these two). The idea is to return a truncator Wrapper (which is, of course, ITruncator), that would finish or discard the word (to either left or right) based on the inner truncator result.

How does that sound like?

mexx commented 10 years ago

@alextercete Cool you're going to push this forward. I like the names DiscardIncompleteWord and FinishIncompleteWord.

DiscardingIncompleteWord and FinishingIncompleteWord would be extension methods that could be applied to both Truncator.FixedLength and Truncator.FixedNumberOfCharacters (but only to these two)

That sounds weird, how would the library user know about this limitation?

I'm currently unsure about the concrete use case for this functionality and can't conclude whether incomplete word handling is an option for a strategy or a strategy itself. @hazzik Can you elaborate the use case you had in mind?

alextercete commented 10 years ago

That sounds weird, how would the library user know about this limitation?

I was thinking about adding an interface, that is a specialization of ITruncator (something like ICharacterWiseTruncator), which both FixedNumberOfCharactersTruncator and FixedLengthTruncator would implement. Then the extension methods would only apply to the specialization:

public static ITruncator DiscardingIncompleteWord(this ICharacterWiseTruncator innerTruncator)
{
    // (...)
}

public static ITruncator FinishingIncompleteWord(this ICharacterWiseTruncator innerTruncator)
{
    // (...)
}

By changing the return type of Truncator.FixedLength and Truncator.FixedNumberOfCharacters to ICharacterWiseTruncator, the library user would only be able to use the extension methods for these truncators.

MehdiK commented 10 years ago

Thanks for picking this up @alextercete. I really like the proposed names.

I don't think these should be extension of existing strategies. There are currently three strategies: FixedLength, FixedNumberOfCharacters and FixedNumberOfWords. This functionality obviously doesn't apply to FixedNumberOfWords at all and the other two are more or less the same except that one includes the truncation character and one doesn't, and that doesn't matter when you want to finish or leave out the last word. So I think these should be two new strategies.

One thing I might also add here is that the implementation should be able to truncate from left and right (depending on the provided TruncateFrom parameter). Godspeed :)

alextercete commented 10 years ago

Ok, now I see the source of my confusion. I thought the value for FixedLength was the expected size of the the final string, while the value for FixedNumberOfCharacters was the expected number of non-whitespace characters in the final string.

Thanks for the reminder on left and right truncation. I'll keep that in mind.

I'll put together a Pull Request in the next few days to address this.

hazzik commented 10 years ago

Sorry, I'm little bit late, and haven't read all the discussion yet. But here some of my thoughts: As long as you add DiscardingIncompleteWord() or FinishingIncompleteWord() to Truncator.FixedLength the strategy become not a Fixed Length, but variable length, so I'm strongly disagree with this proposal.

alextercete commented 10 years ago

We ended up deciding against the extension methods. But good point, anyway.

snluu commented 9 years ago

Hi @alextercete! Are you still working on this? Anything I can do to help push this forward?

alextercete commented 9 years ago

I started working on this, but eventually stopped. Looking into it again I remember why I couldn't decide on what approach to take. Here's why:

There are, currently, three truncation strategies available:

  1. Fixed length: The resulting string is going to have exactly the number of characters that was provided as a parameter:

    "abc def ghi".Truncate(8, "...", Truncator.FixedLength); // "abc d..."
  2. Fixed number of characters: The resulting string is going to have exactly the number of non-whitespace characters that was provided as a parameter:

    "ab cd ef gh ij".Truncate(8, "...", Truncator.FixedNumberOfCharacters); // "ab cd e..."
  3. Fixed number of words: The resulting string is going to have exactly the number of words that was provided as a parameter:

    "ab cd ef gh ij".Truncate(3, "...", Truncator.FixedNumberOfWords); // "ab cd ef..."

Word completion would only make sense for 1 and 2. As far as I understand it, this is the expected behavior:

"abc def ghi".Truncate(8, "...", Truncator.FixedLength); // "abc d..."
"abc def ghi".Truncate(8, "...", Truncator.FixedLengthDiscardingIncompleteWord); // "abc..."
"abc def ghi".Truncate(8, "...", Truncator.FixedLengthFinishingIncompleteWord); // "abc def..."

"ab cd ef gh ij".Truncate(8, "...", Truncator.FixedNumberOfCharacters); // "ab cd e..."
"ab cd ef gh ij".Truncate(8, "...", Truncator.FixedNumberOfCharactersDiscardingIncompleteWord); // "ab cd..."
"ab cd ef gh ij".Truncate(8, "...", Truncator.FixedNumberOfCharactersFinishingIncompleteWord); // "ab cd ef..."

Not that I like the giant method names, but I think these are more of an addition to the existing truncation strategies (those that deal with characters as opposed to words) than they are new strategies themselves.

I actually think that all of the truncation API could be modified to allow combining the truncation options. For example:

"abc def ghi".Truncate.With("...").ToLength(8); // "abc d..."
"abc def ghi".Truncate.RightToLeft.With("...").ToLength(8); // "...f ghi"
"abc def ghi".Truncate.With("...").DiscardingIncompleteWord.ToLength(8); // "abc..."

"abc def ghi".Truncate.With("...").ToNumberOfCharacters(8); // "ab cd e..."

"ab cd ef gh ij".Truncate.With("...").ToNumberOfWords(3); // "ab cd ef..."

I believe this change would lead to more reusable and easy-to-test code.

snluu commented 9 years ago

Do you think it would be simpler to do something like this?

"abc def ghi".TruncateByWords(5 /* desiredChars */, Truncator.CompleteWord); // "abc def"
"abc def ghi".TruncateByWords(5 /* desiredChars */, Truncator.CompleteWord).With("..."); // "abc def..."
"abc def ghi".TruncateByWords(5 /* desiredChars */, Truncator.DiscardIncompleteWord); // "abc"
"abc def ghi".TruncateByWords(5 /* desiredChars */, Truncator.DiscardIncompleteWord).With("..."); // "abc..."

// vs.
"abc def ghi".Truncate(5 /* maxChars */); // "abc d"
"abc def ghi".Truncate(5 /* maxChars */).With("..."); // "abc d..."
"abc def ghi".Truncate(8 /* maxChars */); // "abc def" (the whitespace is trimmed)
"abc".Truncate(5 /* maxChars */); // "abc"
"abc".Truncate(5 /* maxChars */).With("..."); // "abc"

TruncateByWords understands the notion of words. Note that Truncate takes in a maxChars, implying the output could be less than that (happens when trimming white spaces). Otherwise it'd just be syntactic surgar for SubString.

Note this interesting use case. This is how one can ensure the display string won't overflow:

string sentence= "Pneumonoultramicroscopicsilicovolcanoconiosis is a very long word";
string display = sentence.TruncateByWords(10 /* desiredChars*/, Truncator.DiscardIncompleteWord);
// display == string.Empty;
if (string.IsNullOrWhiteSpace(display))
{
    display = sentence.Truncate(maxChars: 10).With("..."); // "Pneumonoul..."
}
MehdiK commented 9 years ago

Thanks for reviving this thread. There's some prior (abandoned) work on this at https://github.com/MehdiK/Humanizer/pull/210. That might be an easier starting point!

Bartmax commented 8 years ago

I just made one myself before visiting this thread.

Maybe it's useful to someone:

public class WordsInLengthTruncator : ITruncator
    {
        public string Truncate(string value, int length, string truncationString, TruncateFrom truncateFrom = TruncateFrom.Right)
        {
            if (value == null)
                return null;

            if (value.Length == 0)
                return value;

            if (value.Length < length) return value;

            if (truncationString.Length > length) throw new ArgumentOutOfRangeException("truncationString cannot be longer than the max length");

            var words = value.Split((char[])null, StringSplitOptions.RemoveEmptyEntries);

            int currentLength = 0;
            words = truncateFrom == TruncateFrom.Right ? words : words.Reverse().ToArray();
            foreach (var word in words)
            {
                if (currentLength + word.Length + truncationString.Length > length) break;
                currentLength += (word.Length +1);// +1 to add space character for length computation.
            }
            if (currentLength == 0) currentLength = length; // this is a special case where first word is longer than length, it will break it on characters.
            else currentLength = truncateFrom == TruncateFrom.Right
                    ? currentLength - 1  // -1 removes the extra space character on Right direction.
                    : currentLength + 1; // +1 removes the extra space character on Left direction.

            return truncateFrom == TruncateFrom.Right
                ? value.Substring(0, currentLength) + truncationString
                : truncationString + value.Substring(value.Length - currentLength);

        }
    }
hazzik commented 8 years ago

@Bartmax thanks, but your currentLength calculation algorithm is very unintuitive and non performant.

  1. value.Split((char[])null, StringSplitOptions.RemoveEmptyEntries); - searches whole string for whitespace characters and then splits a string.
  2. then you count words.

The more optimal and easy to understand solutions is following:

//This only for `TruncateFrom.Left`
int currentLength = length;
for (; index < value.Length; index++)
{
    if (char.IsWhiteSpace(value[index]))
    {
        break;
    }
}
Bartmax commented 8 years ago

@hazzik well.... to be fair, I just bought that from Humanizr own truncator for consistency.

https://github.com/Humanizr/Humanizer/blob/dev/src/Humanizer/Truncation/FixedNumberOfWordsTruncator.cs#L19

I see how then you do a foreach. I find myself the foreach more unintuitive but that may be just me. can't say anything about performant, specially without data. Based on the code I would assume you just throw non performant just because.

aloisdg commented 8 years ago

Any update on this?

gdoron commented 7 years ago

@hazzik @mexx Is there a reason why you didn't take @Bartmax 's solution? Do you need a better one from the community? I'm asking because there's still a jump-in tag on the issue, but it hangs dry for almost 3 years now.

aloisdg commented 7 years ago

@gdoron I think we mostly need someone to add tests and a clean PR, if you are motivated ;)

We can improve performance on @Bartmax 's solution and Humanizr own truncator in another PR.

gdoron commented 7 years ago

@aloisdg I'm afraid my hands are full at the moment, we are launching a new product in the next week... and I have 0 knowledge about humanizer except the very basic usage, so I'm afraid I can not accept the task. Sorry. 😢

BTW FYI, @Bartmax, I saw Amazon.com except from truncate on whole words, they also truncate in the middle of numbers, which I think makes a lot of sense in most cases.