Truncating long strings not working

IQ334 commented 2 years ago

Hello. I use truncating option for output format to avoid 'File name too long' error. -o '%(title).80s_%(upload_date)s_%(channel)s' Title would be truncated 80 characters. It works on Python version, but it didn't work on latest release. How to set format to truncate output file name on Golang?

Kethsar commented 2 years ago

I'll look into adding that. I implemented a very simple and dumb formatter to match the Python formatting strings so people wouldn't have to change them after the move to golang.

IQ334 commented 2 years ago

Thanks. My native language is double byte character, so I often face file name too long error.

sheepwng commented 2 years ago

Does the registry fix not work for Windows? https://docs.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=cmd

BobVul commented 2 years ago

@sheepwng Problem is that only adjusts the max path length. The max file name length is still fixed at 255 (UTF-16 codepoints in Windows:NTFS, bytes in Linux:ZFS,XFS,Ext).

One of the streams I tried to archive today decided to add ！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！ to the title... those are 3 bytes each in UTF-8. Fun.

BobVul commented 2 years ago

@Kethsar With the new implementation, would it be possible add an option to specify length in bytes/codepoints rather than Unicode characters? Since that ends up being my underlying OS limitation, but I can't guess what kind of characters end up in the stream title -- and setting a low limit will end up truncating English titles unnecessarily.

Kethsar commented 2 years ago

@BobVul right now the auto-truncation does check length in bytes, though indeed Go reads each character as UTF-8, and I am unsure if that can cause the string to still be too long if the OS or fs is using something that makes each character larger than UTF-8. The problem with doing a hard length limit on bytes alone is that any character that is normally multiple bytes can get truncated and come out as garbage. I don't know if that garbage character might then have adverse affects anywhere, and I don't know how to check if it is a garbage character.

BobVul commented 2 years ago

Yea, that's a good point. You'd have to remove whole characters and re-check the length at each step, which is a bit ugly. Loop => if > max size, remove last character, repeat. (e: sorry I misread, yea, with an unknown encoding that's impossible, and that's how you've implemented it with UTF-8 already)

I think most Linux FSes work off bytes, so it's probably fine there. Windows/NTFS is the main exception, but since it supports 255 UTF-16 code units I think it should work with any possible combination of 255 UTF-8 bytes -- worst case you're over-truncating where NTFS could store more but that's such a minor edge case.

I wasn't aware that auto-truncation had been implemented, thanks. I should probably update. I suppose it truncates the end of the filename rather than a specific field?

Kethsar commented 2 years ago

Oh, I was under the impression you were using the latest version and stuff still wasn't working. I added auto-truncating the title portion of the file name since it's the only thing that would likely push it beyond 255 bytes long. https://github.com/Kethsar/ytarchive/commit/64ec1d7b6ce59961376bb5556003c9ed9d737158#diff-3a710ab6a1dd3264a76a1e4c4c3ebcee14762ef3a66f707726e17fd5fa255715R770

BobVul commented 2 years ago

Yea, I do need to update. Just thought I'd check for open issues first and found this one, based on this and the comment on #76 thought it wasn't implemented yet.

Nice that it just does the title too, that's pretty perfect for this case. Thanks!

Kethsar / ytarchive

Truncating long strings not working #60