dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.3k stars 150 forks source link

High memory usage with incremental #1710

Closed Deninc closed 4 days ago

Deninc commented 3 weeks ago

dlt version

0.5.3

Describe the problem

I've found that the extraction phase is hogging the memory if I enabled the dlt.sources.incremental and primary_key=().

Expected behavior

I'm not sure if this is a bug. Is there a way I can limit the memory usage?

Steps to reproduce

My test is with a test.jsonl file of 2.76 million rows, around 3.66GB in size.

[data_writer]
buffer_max_items=100000
file_max_items=100000
@dlt.resource(
    standalone=True,
    table_name="event",
    write_disposition="merge",
    merge_key="event_date",
    max_table_nesting=0,
    # primary_key=(),
)
def event_resource(
    event_time=dlt.sources.incremental(
        "event_date",
        initial_value=date.fromisoformat("2023-04-01"),
    ),
):
    with open('test.jsonl', 'r') as file:
        for line in file:
            yield json.loads(line)

pipeline = dlt.pipeline(
    pipeline_name="benchmark",
    destination="filesystem",
    dataset_name="benchmark",
    progress="log",
)
resource = event_resource()
pipeline.extract(resource)

The first case the memory usage is low (179.00 MB), but it takes forever to run (rate: 33.07/s).

------------------------------ Extract benchmark -------------------------------
Resources: 0/1 (0.0%) | Time: 398.74s | Rate: 0.00/s
event: 13188  | Time: 398.73s | Rate: 33.07/s
Memory usage: 179.00 MB (37.30%) | CPU usage: 0.00%

After that I add primary_key=() to disable deduplication. It runs much faster (rate: 20345.09/s), but now the memory usage is too high (12208.89 MB).

------------------------------ Extract benchmark -------------------------------
Resources: 1/1 (100.0%) | Time: 135.88s | Rate: 0.01/s
event: 2764522  | Time: 135.88s | Rate: 20345.09/s
Memory usage: 12208.89 MB (64.80%) | CPU usage: 0.00%

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

No response

dlt destination

No response

Other deployment details

No response

Additional information

No response

Deninc commented 3 weeks ago

I've also changed these settings. But the benchmark stays around the same.

#[data_writer]
#buffer_max_items=100000
#file_max_items=100000
rudolfix commented 3 weeks ago

@Deninc are you able to post one item from your json file? disabling deduplication should make runs faster and decrease the memory usage. is it possible that "event_date" is not very granular? ie you have millions of records with the same date

btw. do batching for better performance: https://dlthub.com/docs/reference/performance#yield-pages-instead-of-rows if your json if well formed and not nested you may try to parse it with pyarrow or duckdb and yield arrow batches instead. then you get maybe 30x or 100x speedup...

Deninc commented 3 weeks ago

Hi @rudolfix yes basically for this dataset all event_date is the same. The API I'm loading accept from_date and to_date only, so I'm doing a daily batch merge (delete-insert).

disabling deduplication should make runs faster and decrease the memory usage

Here it actually increase the memory usage significantly, I'm not sure why?

Deninc commented 3 weeks ago

@rudolfix I can confirm using datetime instead of date solve the issue.

event_time=dlt.sources.incremental(
    "time",
    initial_value=datetime.fromisoformat("2024-08-17T00:00:00Z"),
    primary_key=(),
),
------------------------------ Extract benchmark -------------------------------
Resources: 1/1 (100.0%) | Time: 65.46s | Rate: 0.02/s
Memory usage: 89.05 MB (30.40%) | CPU usage: 0.00%
Deninc commented 3 weeks ago

Updated, the above benchmark was wrong. I used initial_value=datetime.fromisoformat("2024-08-17T00:00:00Z"), which is a future date.

The correct benchmark is here.

------------------------------ Extract benchmark -------------------------------
Resources: 0/1 (0.0%) | Time: 121.90s | Rate: 0.00/s
event: 2745596  | Time: 121.90s | Rate: 22524.05/s
Memory usage: 4486.80 MB (52.70%) | CPU usage: 0.00%
rudolfix commented 3 weeks ago

@Deninc I think we'll disable boundary deduplication by default in next major release

rudolfix commented 3 weeks ago

https://github.com/dlt-hub/dlt/issues/1131