astronomer / apache-airflow-providers-transfers

https://apache-airflow-provider-transfers.rtfd.io/
Apache License 2.0
11 stars 3 forks source link

Add chunking logic in read method #56

Open sunank200 opened 1 year ago

sunank200 commented 1 year ago

Describe the bug A clear and concise description of what the bug is. I tried an 11 GB file (zip file of 11 GB) from S3 to GCS on a worker of 500 Mb and it got killed because of memory:

[2023-04-05, 21:03:34 UTC] {local_task_job.py:212} INFO - Task exited with return code Negsignal.SIGKILL

Expected behavior The read method should only load chunks into memory. Currently, if there are multiple files in a folder each file is loaded into memory. But for scenarios when a single file is very large, we should have a logic to load only chunks at once.

kaxil commented 1 year ago

Currently, if there are multiple files in a folder each file is loaded into memory

Yeah, entire file shouldn't be loaded in the memory. It can be one of the options but not the only option.

Flow (from fastest path to slowest):

  1. Native path
  2. Stream lines/bytes from source to destination via the worker
  3. "naive" path - where we download all files from source to worker and then upload from worker to destination