childmindresearch / bids2table

Efficiently index large-scale BIDS neuroimaging datasets and derivatives
https://childmindresearch.github.io/bids2table/
MIT License
13 stars 5 forks source link

Investigate os.scandir for initial file crawling #32

Closed nx10 closed 3 months ago

nx10 commented 4 months ago

os.scandir is a fast directory iterator that avoids stat() syscalls

https://peps.python.org/pep-0471/

b2t currently uses iglob

https://github.com/childmindresearch/bids2table/blob/c3f4c12d97eb7bfb8f18da8ce4150ec3a2348493/bids2table/extractors/bids.py#L45

Not sure what performance improvements are realistic to see here, but this random benchmark shows >10x, so it might be worth investigating

https://dev.to/guionardo/fast-folder-iteration-in-python-3g1f

clane9 commented 3 months ago

Thanks for the pointer. I looked into it and it seems iglob uses scandir under the hood. I also did a quick test comparing recursive scandir (as in the blog) with glob and they performed roughly similar. Closing for now but we may revisit this in the future.