TheLastGimbus / GooglePhotosTakeoutHelper

Script that organizes the Google Takeout archive into one big chronological folder
https://aur.archlinux.org/packages/gpth-bin
Apache License 2.0
3.88k stars 191 forks source link

Out of Memory Issue with 30k+ Media Files #164

Closed matt-boris closed 1 year ago

matt-boris commented 1 year ago

While guessing the dates from the files, my system (8GB RAM) runs out of memory :(

I get to about here before the system kills the script due to memory.

Guessing dates from files : ██████████████████████████.............. 20751/31442

Any suggestions? Could files be written to the output folder on an ongoing basis instead of keeping all the info in memory?

Thanks again!

matt-boris commented 1 year ago

I'm going to just run the tool multiple times on each input folder and have multiple outputs. I'd love to be able to run all at once and use the divide into dates feature!

TheLastGimbus commented 1 year ago

Oh shit :fearful: never thought this would happen

run the tool multiple times on each input folder

Best solution would be to:

  1. Extract all zips
  2. Merge them into one big with "year folders"
  3. Divide final year folders into ~4 groups, move them to separete folders and run on those grups individaully

Don't do it on each unzipped as-is because zips are fragmented randomly, and contenst of one "year folder" may be fragmented over those zips

TheLastGimbus commented 1 year ago

Like honestly i don't have any good idea how this happens... heaviest that Media class could weight is, idk, 128bytes? 128bytes * 31442 ~= 4 MB

Maybe updated Dart will help when i do new release...

If there is any Dart expert that can identify why, pls help

TheLastGimbus commented 1 year ago
matt-boris commented 1 year ago
  • Are you using interacive (i suppose you're not?) ?

Yeah, I'm not.

  • Can you send a screenshot of memory usage? With how much gpth takes over time

I'll get around to this sometime either this weekend or the upcoming week. I'm not sure how in-depth I can get for you on a Synology NAS, but I'll do some digging around.

Maybe updated Dart will help when i do new release...

Probably wouldn't hurt! 🤞🏻

matt-boris commented 1 year ago

@TheLastGimbus So this is over about 30 min of running the tool. Really interesting to see the massive jumps in memory utilization. gpth is using about 6GB of memory at the time when it gets killed.

I wonder if https://docs.flutter.dev/development/tools/devtools/memory can be used to determine if there's a memory leak of some sort.

Screenshot 2023-02-04 at 8 47 31 PM
matt-boris commented 1 year ago

Bumped dart version unfortunately didn't help :( I'll keep trying stuff!

matt-boris commented 1 year ago

Trying to use the DevTools to see what's eating up all the memory.

Using dart --enable-vm-service ./bin/gpth.dart --input <input> --output <output> --copy --divide-to-dates

I could send you the memory dump once this is done if you'd like?

TheLastGimbus commented 1 year ago

yesss that would absolutely help

i stared making nighly builds with some options disabled, but looks like you've got Dart figured out:

my theory is that may be something wrong in reading jsons/exifs?

could you please try disabling (commenting out) json/exif/both extractors here (guess can be left enabled):

https://github.com/TheLastGimbus/GooglePhotosTakeoutHelper/blob/38ea053b60253f1ff3c8eb3d89c9fe7b8aeee6fb/bin/gpth.dart#L110-L114

matt-boris commented 1 year ago

Save these into CSVs https://paste.mozilla.org/2qBN5kiA, https://paste.mozilla.org/rtDtAees

I've no idea where this export button within DevTools downloads anything to on my local machine. Will ping you if I find it 😬

Screenshot 2023-02-08 at 2 35 26 PM
matt-boris commented 1 year ago

@TheLastGimbus yeah it's definitely those date extractors! This run, the tool flew through the input (all 30k+ files) and is already copying them to their destination folder. This is the furthest I've gotten now 🎉

Would still be nice to get dates on these files though :)

matt-boris commented 1 year ago

I'll bet you I run into my issue when this line is hit on a large file (multiple GB video files) and my system just runs out of memory. https://github.com/TheLastGimbus/GooglePhotosTakeoutHelper/blob/38ea053b60253f1ff3c8eb3d89c9fe7b8aeee6fb/lib/date_extractors/exif_extractor.dart#L10

The CPU on the NAS may not be fast enough to garbage collect it all in time before more bytes are read in.

If this exif extractor was a little smarter around its memory usage, that'd be great, since I'd still be able to benefit from it with all other files that don't affect the memory nearly as much.

TheLastGimbus commented 1 year ago

oh shit...

:joy: :joy: :joy:

got it, will fix in a second...

TheLastGimbus commented 1 year ago

done! fixed with e0d9ee3e71def69d74eba7cf5ec204672924726d / https://github.com/TheLastGimbus/GooglePhotosTakeoutHelper/releases/tag/v3.3.3