ambv / bitrot

Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay.
MIT License
214 stars 36 forks source link

Improve performance using threads #23

Closed liloman closed 4 years ago

liloman commented 7 years ago

Hi,

I've noted that it takes time related to number of files. So I'm trying to use it for big number of files and it takes so long.

cd /tmp; mkdir more ; cd more/
#create a 320KB file                                                              
dd if=/dev/zero of=masterfile bs=1 count=327680                                 
#split it in 32768 files (instantly) + masterfile = 32769                       
split -b 10 -a 10 masterfile    
#waiiiiiiiiiiiiiiiiiiiiiiiit
bitrot -v 

I suppose that it could be made to work with threads and improve it a lot, cause single threaded to calculate the sha1 and insert into sqlite for x files is too old school for nowdays. ;)

I'm using an Intel i7 so I have plenty of spare threads to burn I reckon something like a central buffer/MQ/DB/x where insert the files to be hashed and n threads to calculate and insert/update them (or just another thread for just sqlite) could work (they collect files from the central buffer, n at a time), sounds like a cool project. ;)

I'm using it for this tool. ;)

https://github.com/liloman/heal-bitrots

Cheers!

ambv commented 7 years ago

Do you see the tool being bound by CPU on your box? In my case it doesn't even get to 100% CPU on one core, this is I/O bound. In this sense adding additional threads would complicate the code but is unlikely to give us noticeable performance wins.

liloman commented 7 years ago

Hi ambv,

Yes totally 100% CPU, just note that /tmp is tmpfs and then RAM.

Respect the solution:

1- I don't know python (you seem to know a bit... ;) )

2-I just wrote some ballpark ideas. The normal and obvious way will be, first messure bottlenecks and then fix them. ;)

#time bitrot -v (ctrl -c )
 60.2%^CTraceback (most recent call last):
  File "/home/charly/.local/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
    bt.run()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
    cur, p_uni, new_mtime, new_sha1,
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
    if os.path.exists(stored_path):
  File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
    os.stat(path)
KeyboardInterrupt

real    1m28.468s
user    1m12.890s
sys 0m14.966s

#time for file in *; do sha1sum $file &>/dev/null; done
real    1m11.391s
user    0m19.609s
sys 0m50.629s

So CPU bound? ;)

By the way I did a VERY quick and dirty profiling before writting this yesterday with KCacheGrind, I don't know python so I just used first search hit:

https://julien.danjou.info/blog/2015/guide-to-python-profiling-cprofile-concrete-case-carbonara

And I think I remember seeing it hit by lstat. :)

PD: I was thinking you had abandoned this project...

Cheers and thanks!! :)

liloman commented 7 years ago

To clarify it, when I launch bitrot it start like 50% and starts increasing. So:

#time bitrot -v 
 77.7%^CTraceback (most recent call last):
  File "/home/charly/.local/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
    bt.run()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
    cur, p_uni, new_mtime, new_sha1,
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
    if os.path.exists(stored_path):
  File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
    os.stat(path)
KeyboardInterrupt

real    11m15.066s
user    8m56.544s
sys 2m14.869s
ambv commented 7 years ago

If you're willing to test how this could work with a multiprocessing worker pool, I'm happy to review this. I'm not going to be able to spend time implementing this myself.

liloman commented 7 years ago

Ok, I will give a try, not sure if this week or the next, but It seems pretty easy and fun: :)

https://wltrimbl.github.io/2014-06-10-spelman/intermediate/python/04-multiprocessing.html

Could you recommend me any profiling tool for python? (linux)

Cheers!

liloman commented 7 years ago

Done: It was just a stat and sqlite issue. Now it goes blaaaaazing fast even in my laptop: ;)

#time ~/Clones/bitrot/src/bitrot.py 
Finished. 0.62 MiB of data read. 0 errors found.
32769 entries in the database, 32769 new, 0 updated, 0 renamed, 0 missing.
Updating bitrot.sha512... done.

real    0m4.879s
user    0m4.086s
sys 0m0.726s

I have created some tests in bats (bash script) to implement it (what a burden is test something without tests ;) )

ambv commented 4 years ago

So... using threads didn't help much. But then I used a ProcessPoolExecutor and the thing is easily 4X faster on my box. Here's a test on a 5G directory of a 1000 files.

Before:

bitrot -tv  7.46s user 1.33s system 88% cpu 9.993 total

After:

bitrot -tv  14.68s user 2.67s system 814% cpu 2.129 total
p1r473 commented 4 years ago

@ambv have you any insight into optimal chunk sizes? Right now I am using 1048576 which is 1 meg It used to be 16384 which is the block size in HFS+; 4X the block size in ext4

I have found chunk size to REALLY affect time needed to hash on my RAID magnetic drives.

ambv commented 4 years ago

Interesting, that's like 64X bigger. The last time I measured it was way back in 2013 and I haven't touched it since. What's the difference that you're observing?

p1r473 commented 4 years ago

Interesting, that's like 64X bigger. The last time I measured it was way back in 2013 and I haven't touched it since. What's the difference that you're observing?

@ambv Going from like 7 hours to 4 hours on 4ish terabytes on magnetic drives RAID 0 I have been trying to see if RAID stripe size matter.... I am not an expert on this stuff, but bigger chunk size is definitely faster