IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 46 forks source link

Needs better filename sanitizing #53

Closed n4mwd closed 4 years ago

n4mwd commented 4 years ago

2019-10-24 22:16:47 Eastern Daylight Time 842 INFO:archive_photos Fetching photo 'Mae"s Bulkhead!!!' (141/320) Traceback (most recent call last): File "C:\Python27\Scripts\yahoo.py", line 667, in archive_photos(yga) File "C:\Python27\Scripts\yahoo.py", line 293, in archive_photos with open(fname, 'wb') as f: IOError: [Errno 22] invalid mode ('wb') or filename: u'2071862851-Mae"s Bulkhead!!!.jpg'

The double quote inside the filename seems to be throwing it off. I had another with a question mark. The above error occurred after about an hour of processing due to the large file sizes.

Also, as an enhancement, it would be nice if the script could detect if the photo was already there and skip it if it is.

n4mwd commented 4 years ago

URLs are sometimes sanitized by converting non-alphanumeric chars into escaped chars. So "Joe"s files" becomes "Joe%22s files" and "what about this?" becomes "what about this%3F" and
"five %" becomes "five %25".
This could work here and would also make yahoo names into legal windows file names.

This would work until you hit "This .. Is ... a ... really ... long ... filename" which could exceed MAXFILESIZE. In that case, I would just use the first 20 chars plus a 4 digit hex count. So "Little tommy...[500 chars] ... photo 1" becomes "Little tommy...[8 chars] 0000" and "Little tommy...[500 chars] ... photo 2" becomes "Little tommy...[8 chars] 0001".

Might be a url enocoder in python that could be repurposed. Just guessing.

flyintheointment commented 4 years ago

I'm using windows so had to implement a fix for the photos for one of the groups I was downloading.

the following sanitizing worked for the 3 groups I tried it on.

fname = re.sub(r'[\/*?:"<>|-]',"",fname)

you need to implement it in the mkchdir section as well. Hope this helps

WarriorsLance commented 4 years ago

I am having this same problem with several groups. I have a small group that I keep testing to see if there has been a fix, but none so far.

flyintheointment commented 4 years ago

I'm new to this whole github thing.. so I don't know how to help fix the problem in the original, but if you look at my branch you'll see I added the fix for fname and mkchdir. I tested it on the group I was having problems with and it works fine now by deleting the problem characters.

IgnoredAmbience commented 4 years ago

@flyintheointment thanks for the offer, I'm going for a more robust solution though.