codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.89k stars 2.1k forks source link

Replace os.listdir by os.scandir in utils.get_available_languages #942

Closed tgrandje closed 1 year ago

tgrandje commented 2 years ago

I ran into an os.error (Too many open files) while parsing files on a debian machine :

  File "/home/blablah/.cache/pypoetry/virtualenvs/adoc-H6ni72Z6-py3.10/lib/python3.10/site-packages/newspaper/article.py", line 214, in parse
  File "/home/blablah/.cache/pypoetry/virtualenvs/adoc-H6ni72Z6-py3.10/lib/python3.10/site-packages/newspaper/article.py", line 495, in set_meta_language
  File "/home/blablah/.cache/pypoetry/virtualenvs/adoc-H6ni72Z6-py3.10/lib/python3.10/site-packages/newspaper/utils.py", line 346, in get_available_languages
OSError: [Errno 24] Too many open files: '/home/blablah/.cache/pypoetry/virtualenvs/adoc-H6ni72Z6-py3.10/lib/python3.10/site-packages/newspaper/resources/text'

Using help on os.listdir returns : "[...] On some platforms, path may also be specified as an open file descriptor; the file descriptor must refer to a directory. [...]"

The documentation also states that "[t]he scandir() function returns directory entries along with file attribute information, giving better performance for many common use cases.".

I ran a benchmark with timeit between the old code and the new one ; they are pretty much the same (ie. 67.3 µs ± 15.6 µs and 64.2 µs ± 12.8 µs resp.).

I'm not totally sure why os is trying to open files while using the listdir command, but switching to scandir should fix that behaviour (according to the documentation...). to fix the exception.

tgrandje commented 1 year ago

Well, I should have tested this beforehand. My issue (ie OSError: [Errno 24] Too many open files) is still happening and scandir doesn't change a thing. I'll close my PR for now.