Possible to let us choose encoding type for text/filename when truncating?

HelpfulDucker commented 4 months ago

Is your feature request related to a problem? Please describe.

the recent change to the get_string_bytes() function has reduced the max filename length by about half for me. I'm running ofscraper in a ubuntu vm on unraid. Both systems uses utf-8. I have also mounted the save_location with the exact same path as it is on unraid /mnt/user/Media/

"file_format": "{date} {post_id} {text}.{ext}",
"truncation_default": true

before the recent changes to the truncation function, filenames with {text} were much longer, now it is about half the length.

I think the problem is at

def get_string_byte_size(text):
  text=str(text)
  """
  This function estimates the byte size of a string considering ASCII characters.

  Args:
      text: The string to analyze.

  Returns:
      The estimated byte size of the string.
  """
  total_size = 0
  for char in text:
    try:
      if ord(char)<128:
        total_size += 2  # 2 bytes for ASCII characters
      else:
          total_size+= 4
    except ValueError:
      total_size += 4  # 4 bytes for non-ASCII characters (assumption)
  return total_size

with utf-8, ord(char)<128 should be 1 byte.

Describe the solution you'd like let us choose the encoding by setting it in the config file.

"file_options": {
        ...
        "file_format": "{date} {post_id} {text}.{ext}",
        "truncation_default": true,
        "encoding": "utf-8"
    },

def get_byte_size(input_string):
    byte_size = len(input_string.encode('utf-8'))
    return byte_size

Describe alternatives you've considered I have thought about using --original flag to keep the whole filename and renaming the files myself, but that would cause a mismatch with the db.

Additional context

datawhores commented 4 months ago

To be honest auto truncation has been a slight pain, when dealing with emojis encoding has worked for me, but I notice that a particular download kept failing because python was trying to encode a string like this

....\u200b🇮\u200b \u200b🇱\u200b\u200b🇴\u200b............

However this download had roman characters emojis, and python could not get the a proper length for the file

With that said I think I will be leaving the default behavior the same here. But I'm okay with providing users who are maybe a little bit more comfortable with their operating system limits a means to change the behavior

The script provides const values that can be store in the a custom_values dict https://of-scraper.gitbook.io/of-scraper/getting-started/config-options/advanced-config-options/changing-const#where-are-they

in your case you can set the 'UTF' consts as utf-8 to get the behavior above in the second function

There is also these two consts values as well

SPECIAL_CHAR_SIZE_UNIX

NORMAL_CHAR_SIZE_UNIX

but it is one or other

UTF

or

SPECIAL_CHAR_SIZE_UNIX

NORMAL_CHAR_SIZE_UNIX

with UTF have preference over the second group

datawhores commented 4 months ago

This is the new function

def get_string_byte_size_unix(text):
    """
  This function estimates the byte size of a string considering ASCII characters.

  Args:
      text: The string to analyze.

  Returns:
      The estimated byte size of the string.
  """
    total_size = 0
    normal_char_size=constants.getattr("NORMAL_CHAR_SIZE_UNIX")
    special_char_size=constants.getattr("SPECIAL_CHAR_SIZE_UNIX")
    utf=constants.getattr("UTF")
    if utf:
        return len(text.encode(utf))
    for char in text:
        try:
            if ord(char) < 128:
                total_size += normal_char_size  # 2 bytes for ASCII characters
            else:
                total_size += special_char_size
        except ValueError:
            total_size += special_char_size  # 4 bytes for non-ASCII characters (assumption)
    return total_size

HelpfulDucker commented 4 months ago

that fixed it. thank your for your work

datawhores / OF-Scraper

Possible to let us choose encoding type for text/filename when truncating? #435