Closed liloman closed 4 years ago
Do you see the tool being bound by CPU on your box? In my case it doesn't even get to 100% CPU on one core, this is I/O bound. In this sense adding additional threads would complicate the code but is unlikely to give us noticeable performance wins.
Hi ambv,
Yes totally 100% CPU, just note that /tmp is tmpfs and then RAM.
Respect the solution:
1- I don't know python (you seem to know a bit... ;) )
2-I just wrote some ballpark ideas. The normal and obvious way will be, first messure bottlenecks and then fix them. ;)
#time bitrot -v (ctrl -c )
60.2%^CTraceback (most recent call last):
File "/home/charly/.local/bin/bitrot", line 30, in <module>
run_from_command_line()
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
bt.run()
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
cur, p_uni, new_mtime, new_sha1,
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
if os.path.exists(stored_path):
File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
os.stat(path)
KeyboardInterrupt
real 1m28.468s
user 1m12.890s
sys 0m14.966s
#time for file in *; do sha1sum $file &>/dev/null; done
real 1m11.391s
user 0m19.609s
sys 0m50.629s
So CPU bound? ;)
By the way I did a VERY quick and dirty profiling before writting this yesterday with KCacheGrind, I don't know python so I just used first search hit:
https://julien.danjou.info/blog/2015/guide-to-python-profiling-cprofile-concrete-case-carbonara
And I think I remember seeing it hit by lstat. :)
PD: I was thinking you had abandoned this project...
Cheers and thanks!! :)
To clarify it, when I launch bitrot it start like 50% and starts increasing. So:
#time bitrot -v
77.7%^CTraceback (most recent call last):
File "/home/charly/.local/bin/bitrot", line 30, in <module>
run_from_command_line()
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
bt.run()
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
cur, p_uni, new_mtime, new_sha1,
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
if os.path.exists(stored_path):
File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
os.stat(path)
KeyboardInterrupt
real 11m15.066s
user 8m56.544s
sys 2m14.869s
If you're willing to test how this could work with a multiprocessing worker pool, I'm happy to review this. I'm not going to be able to spend time implementing this myself.
Ok, I will give a try, not sure if this week or the next, but It seems pretty easy and fun: :)
https://wltrimbl.github.io/2014-06-10-spelman/intermediate/python/04-multiprocessing.html
Could you recommend me any profiling tool for python? (linux)
Cheers!
Done: It was just a stat and sqlite issue. Now it goes blaaaaazing fast even in my laptop: ;)
#time ~/Clones/bitrot/src/bitrot.py
Finished. 0.62 MiB of data read. 0 errors found.
32769 entries in the database, 32769 new, 0 updated, 0 renamed, 0 missing.
Updating bitrot.sha512... done.
real 0m4.879s
user 0m4.086s
sys 0m0.726s
I have created some tests in bats (bash script) to implement it (what a burden is test something without tests ;) )
So... using threads didn't help much. But then I used a ProcessPoolExecutor and the thing is easily 4X faster on my box. Here's a test on a 5G directory of a 1000 files.
Before:
bitrot -tv 7.46s user 1.33s system 88% cpu 9.993 total
After:
bitrot -tv 14.68s user 2.67s system 814% cpu 2.129 total
@ambv have you any insight into optimal chunk sizes? Right now I am using 1048576 which is 1 meg It used to be 16384 which is the block size in HFS+; 4X the block size in ext4
I have found chunk size to REALLY affect time needed to hash on my RAID magnetic drives.
Interesting, that's like 64X bigger. The last time I measured it was way back in 2013 and I haven't touched it since. What's the difference that you're observing?
Interesting, that's like 64X bigger. The last time I measured it was way back in 2013 and I haven't touched it since. What's the difference that you're observing?
@ambv Going from like 7 hours to 4 hours on 4ish terabytes on magnetic drives RAID 0 I have been trying to see if RAID stripe size matter.... I am not an expert on this stuff, but bigger chunk size is definitely faster
Hi,
I've noted that it takes time related to number of files. So I'm trying to use it for big number of files and it takes so long.
I suppose that it could be made to work with threads and improve it a lot, cause single threaded to calculate the sha1 and insert into sqlite for x files is too old school for nowdays. ;)
I'm using an Intel i7 so I have plenty of spare threads to burn I reckon something like a central buffer/MQ/DB/x where insert the files to be hashed and n threads to calculate and insert/update them (or just another thread for just sqlite) could work (they collect files from the central buffer, n at a time), sounds like a cool project. ;)
I'm using it for this tool. ;)
https://github.com/liloman/heal-bitrots
Cheers!