Closed Deninc closed 4 days ago
I've also changed these settings. But the benchmark stays around the same.
#[data_writer]
#buffer_max_items=100000
#file_max_items=100000
@Deninc are you able to post one item from your json file? disabling deduplication should make runs faster and decrease the memory usage. is it possible that "event_date" is not very granular? ie you have millions of records with the same date
btw. do batching for better performance: https://dlthub.com/docs/reference/performance#yield-pages-instead-of-rows if your json if well formed and not nested you may try to parse it with pyarrow or duckdb and yield arrow batches instead. then you get maybe 30x or 100x speedup...
Hi @rudolfix yes basically for this dataset all event_date
is the same. The API I'm loading accept from_date
and to_date
only, so I'm doing a daily batch merge (delete-insert).
disabling deduplication should make runs faster and decrease the memory usage
Here it actually increase the memory usage significantly, I'm not sure why?
@rudolfix I can confirm using datetime
instead of date
solve the issue.
event_time=dlt.sources.incremental(
"time",
initial_value=datetime.fromisoformat("2024-08-17T00:00:00Z"),
primary_key=(),
),
------------------------------ Extract benchmark -------------------------------
Resources: 1/1 (100.0%) | Time: 65.46s | Rate: 0.02/s
Memory usage: 89.05 MB (30.40%) | CPU usage: 0.00%
Updated, the above benchmark was wrong. I used initial_value=datetime.fromisoformat("2024-08-17T00:00:00Z"),
which is a future date.
The correct benchmark is here.
------------------------------ Extract benchmark -------------------------------
Resources: 0/1 (0.0%) | Time: 121.90s | Rate: 0.00/s
event: 2745596 | Time: 121.90s | Rate: 22524.05/s
Memory usage: 4486.80 MB (52.70%) | CPU usage: 0.00%
@Deninc I think we'll disable boundary deduplication by default in next major release
dlt version
0.5.3
Describe the problem
I've found that the extraction phase is hogging the memory if I enabled the
dlt.sources.incremental
andprimary_key=()
.Expected behavior
I'm not sure if this is a bug. Is there a way I can limit the memory usage?
Steps to reproduce
My test is with a
test.jsonl
file of 2.76 million rows, around 3.66GB in size.The first case the memory usage is low (179.00 MB), but it takes forever to run (rate: 33.07/s).
After that I add
primary_key=()
to disable deduplication. It runs much faster (rate: 20345.09/s), but now the memory usage is too high (12208.89 MB).Operating system
macOS
Runtime environment
Local
Python version
3.11
dlt data source
No response
dlt destination
No response
Other deployment details
No response
Additional information
No response