drivendataorg / zamba

A Python package for identifying 42 kinds of animals, training custom models, and estimating distance from camera trap videos
https://zamba.drivendata.org/docs/stable/
MIT License
118 stars 27 forks source link

Checking for missing files in parallel #224

Closed AllenDowney closed 2 years ago

AllenDowney commented 2 years ago

Closes #216

Checking for missing files is slow with goofys. It seems to make lots of small queries to the file system. Running them in parallel with pqdm is much faster. The speed depends on the state of the file system cache, but we can check 246,000 files in 5-8 minutes, compared to about two hours the slow way.

Using pqdm with threads is faster than with processes. Using 16 threads seems to be fast and robust. With more threads, things go faster, but you start to see unpredictable I/O errors.

If an error occurs, it falls back to the slow way.

This fix has only been tested with video files that are mounted from S3 using goofys. It might be good to test with videos stored in a local file system, too.

netlify[bot] commented 2 years ago

Deploy Preview for silly-keller-664934 ready!

Name Link
Latest commit b5fd541386e3c7a9e0b98c71ca0597c29c1e0873
Latest deploy log https://app.netlify.com/sites/silly-keller-664934/deploys/6318b9d4cd2d0a0008a9ee62
Deploy Preview https://deploy-preview-224--silly-keller-664934.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

github-actions[bot] commented 2 years ago

🚀 Deployed on https://deploy-preview-224--silly-keller-664934.netlify.app

codecov-commenter commented 2 years ago

Codecov Report

Merging #224 (b5fd541) into master (17d291b) will decrease coverage by 0.0%. The diff coverage is 77.7%.

@@           Coverage Diff            @@
##           master    #224     +/-   ##
========================================
- Coverage    87.0%   87.0%   -0.1%     
========================================
  Files          29      29             
  Lines        1930    1937      +7     
========================================
+ Hits         1681    1686      +5     
- Misses        249     251      +2     
Impacted Files Coverage Δ
zamba/models/config.py 96.7% <77.7%> (-0.6%) :arrow_down:
AllenDowney commented 2 years ago

Confirmed that it works with 22 local files (a convenience sample of videos)