Closed HelpfulDucker closed 4 months ago
To be honest auto truncation has been a slight pain, when dealing with emojis encoding has worked for me, but I notice that a particular download kept failing because python was trying to encode a string like this
....\u200b🇮\u200b \u200b🇱\u200b\u200b🇴\u200b............
However this download had roman characters emojis, and python could not get the a proper length for the file
With that said I think I will be leaving the default behavior the same here. But I'm okay with providing users who are maybe a little bit more comfortable with their operating system limits a means to change the behavior
The script provides const values that can be store in the a custom_values dict https://of-scraper.gitbook.io/of-scraper/getting-started/config-options/advanced-config-options/changing-const#where-are-they
in your case you can set the 'UTF' consts as utf-8 to get the behavior above in the second function
There is also these two consts values as well
SPECIAL_CHAR_SIZE_UNIX
NORMAL_CHAR_SIZE_UNIX
but it is one or other
UTF
or
SPECIAL_CHAR_SIZE_UNIX
NORMAL_CHAR_SIZE_UNIX
with UTF have preference over the second group
This is the new function
def get_string_byte_size_unix(text):
"""
This function estimates the byte size of a string considering ASCII characters.
Args:
text: The string to analyze.
Returns:
The estimated byte size of the string.
"""
total_size = 0
normal_char_size=constants.getattr("NORMAL_CHAR_SIZE_UNIX")
special_char_size=constants.getattr("SPECIAL_CHAR_SIZE_UNIX")
utf=constants.getattr("UTF")
if utf:
return len(text.encode(utf))
for char in text:
try:
if ord(char) < 128:
total_size += normal_char_size # 2 bytes for ASCII characters
else:
total_size += special_char_size
except ValueError:
total_size += special_char_size # 4 bytes for non-ASCII characters (assumption)
return total_size
that fixed it. thank your for your work
Is your feature request related to a problem? Please describe.
the recent change to the get_string_bytes() function has reduced the max filename length by about half for me. I'm running ofscraper in a ubuntu vm on unraid. Both systems uses utf-8. I have also mounted the save_location with the exact same path as it is on unraid
/mnt/user/Media/
before the recent changes to the truncation function, filenames with {text} were much longer, now it is about half the length.
I think the problem is at
with utf-8, ord(char)<128 should be 1 byte.
Describe the solution you'd like let us choose the encoding by setting it in the config file.
Describe alternatives you've considered I have thought about using --original flag to keep the whole filename and renaming the files myself, but that would cause a mismatch with the db.
Additional context