Open lych77 opened 6 years ago
I'm noticing the same thing. I have a 240 meg file that I'm trying to read in; and I've tried it with 3 different approaches. The first, inserting asyncio.sleep calls after each line is read in with the builtin open function, the second, utilizing aiofiles, the third without async using the builtin open function. Here's the code:
import asyncio
import logging
import aiofiles
logging.basicConfig(format='%(asctime)s %(message)s')
logging.getLogger().setLevel(level=logging.INFO)
async def async_read_with_aiofiles(file_name: str):
line_count = 0
logging.info('starting async aiofiles')
async with aiofiles.open(file=file_name, mode='r', newline='\r\n') as file:
async for _ in file:
line_count += 1
logging.info(f'read {line_count} lines with async aiofiles')
async def async_read_with_stdio(file_name: str):
line_count = 0
logging.info('starting async stdio')
with open(file=file_name, mode='r', newline='\r\n') as file:
for _ in file:
line_count += 1
await asyncio.sleep(0)
logging.info(f'read {line_count} lines with async stdio')
def sync_read_with_stdio(file_name: str):
line_count = 0
logging.info('starting sync stdio')
with open(file=file_name, mode='r', newline='\r\n') as file:
for _ in file:
line_count += 1
logging.info(f'read {line_count} lines with sync stdio')
loop = asyncio.get_event_loop()
Here's the results of running it on a ~240MB file through each of the 3 options:
>>> sync_read_with_stdio(file_name='large_file')
2018-11-28 11:04:21,546 starting sync stdio
2018-11-28 11:04:22,935 read 2769652 lines with sync stdio
>>> loop.run_until_complete(async_read_with_stdio(file_name='large_file'))
2018-11-28 11:06:05,393 starting async stdio
2018-11-28 11:06:57,863 read 2769652 lines with async stdio
>>> loop.run_until_complete(async_read_with_aiofiles(file_name='large_file'))
2018-11-28 11:07:14,362 starting async aiofiles
2018-11-28 11:14:26,550 read 2769652 lines with async aiofiles
BTW; I'm on the exact same platform and versions as @lych77.
Both SelectorEventLoop and ProactorEventLoop display the same behavior. It's just plain slow. Used the code from @lych77
MacOS / Python-3.7.3 / aiofiles-0.4.0 / SelectorEventLoop
traditional: 0.001899950000000039
aiofiles: 1.288835948
RHEL-7.6 / Python-3.7.3 / aiofiles-0.4.0 / SelectorEventLoop
traditional: 0.0033165980130434036
aiofiles: 2.984602615237236
I have the same problem with aiofiles.
My simple benchmark and results on linux/uvloop:
import time
import asyncio
import aiofiles
import uvloop
uvloop.install()
async def _async_file_reader(file_name):
async with aiofiles.open(file_name, 'r') as fp:
async for line in fp:
yield line
async def _async_write_file(file_name, lines):
async with aiofiles.open(file_name, 'w') as fp:
for line in lines:
await fp.write(f"{line}\n")
await fp.flush()
def _file_reader(file_name):
with open(file_name, 'r') as fp:
for line in fp:
yield line
def _write_file(file_name, lines):
with open(file_name, 'w') as fp:
for line in lines:
fp.write(f"{line}\n")
fp.flush()
async def async_main(file_name):
ts = time.time()
lines = []
async for line in _async_file_reader(file_name):
lines.append(line)
te = time.time() - ts
print(f'ASYNC FILE READING: {te:.2f} sec')
ts = time.time()
await _async_write_file('/tmp/bbb.txt', lines)
te = time.time() - ts
print(f'ASYNC FILE WRITING: {te:.2f} sec')
def main(file_name):
ts = time.time()
lines = []
for line in _file_reader(file_name):
lines.append(line)
te = time.time() - ts
print(f'FILE READING: {te:.2f} sec')
ts = time.time()
_write_file('/tmp/aaa.txt', lines)
te = time.time() - ts
print(f'FILE WRITING: {te:.2f} sec')
if __name__ == '__main__':
file_name = 'lines.txt'
main(file_name)
loop = asyncio.get_event_loop()
loop.run_until_complete(async_main(file_name))
loop.close()
# file with 10K lines
FILE READING: 0.01 sec
FILE WRITING: 0.06 sec
ASYNC FILE READING: 2.90 sec
ASYNC FILE WRITING: 2.68 sec
In my real code aiofiles ~14000x slower than not async IO when I use aiofiles
and async for
in reader->chunker->concurrent_workers
pipeline
Maybe a problem in this code that called for each line if we use async for
:
https://github.com/Tinche/aiofiles/blob/cba6910a491f585f2dc4a87e215f5f52ebde6f48/aiofiles/threadpool/utils.py#L31-L37
Therefore, I think async for
is not applicable. read()
should be used instead.
The reason for this issue is that for each line aiofiles
will start new task in executor, which is a huge overhead just for one line reading:
loop.run_in_executor(executor, file.readline)
Code above is just for demonstration, original code is: https://github.com/Tinche/aiofiles/blob/d85a65f1c7c84a98c979e19c12fe1d2f642b9cec/aiofiles/threadpool/text.py#L6-L7 https://github.com/Tinche/aiofiles/blob/145107595fe72b877005fb4ebf9d77ffb754689a/aiofiles/threadpool/utils.py#L5-L10 https://github.com/Tinche/aiofiles/blob/145107595fe72b877005fb4ebf9d77ffb754689a/aiofiles/threadpool/utils.py#L31-L37
Thanks @MrMrRobat for investigating.
Reading binary files also seems to be super slow. I'm streaming files from a VM to a server on the host OS and I get ~1-1.6MBit/s.
Windows 7 64-bit, Python 3.7.0 64-bit, aiofiles 0.4.0
Test program:
Output:
The sample file is not a very big one (less than 1MB), but it contains a bit "too many" lines. So is the result caused by excess thread context switches? Can this be avoided by some approach?