Importing large Records.json from Takeout fails with no message

savvasdalkitsis commented 3 months ago

Describe the bug Importing my (large) location history from Google Takeout fails with no error messages

Version freikin/dawarich:latest

Logs

/var/app # bundle exec rake import:big_file['/tmp/Records.json','kurosavvas@gmail.com']
[dotenv] Set DATABASE_PORT
[dotenv] Loaded .env.development
W, [2024-07-29T10:52:30.754729 #126]  WARN -- : DEPRECATION WARNING: `Rails.application.secrets` is deprecated in favor of `Rails.application.credentials` and will be removed in Rails 7.2. (called from <main> at /var/app/config/environment.rb:5)
D, [2024-07-29T10:52:31.562514 #126] DEBUG -- :   User Load (0.5ms)  SELECT "users".* FROM "users" WHERE "users"."email" = $1 LIMIT $2  [["email", "EMAIL@gmail.com"], ["LIMIT", 1]]
D, [2024-07-29T10:52:31.562766 #126] DEBUG -- :   ↳ lib/tasks/import.rake:9:in `block (2 levels) in <main>'
D, [2024-07-29T10:52:31.644237 #126] DEBUG -- :   TRANSACTION (0.2ms)  BEGIN
D, [2024-07-29T10:52:31.644938 #126] DEBUG -- :   ↳ lib/tasks/import.rake:13:in `block (2 levels) in <main>'
D, [2024-07-29T10:52:31.645787 #126] DEBUG -- :   Import Create (1.7ms)  INSERT INTO "imports" ("name", "user_id", "source", "created_at", "updated_at", "raw_points", "doubles", "processed", "raw_data") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) RETURNING "id"  [["name", "/tmp/Records.json"], ["user_id", 1], ["source", 2], ["created_at", "2024-07-29 09:52:31.642947"], ["updated_at", "2024-07-29 09:52:31.642947"], ["raw_points", 0], ["doubles", 0], ["processed", 0], ["raw_data", nil]]
D, [2024-07-29T10:52:31.646293 #126] DEBUG -- :   ↳ lib/tasks/import.rake:13:in `block (2 levels) in <main>'
D, [2024-07-29T10:52:31.647681 #126] DEBUG -- :   TRANSACTION (1.1ms)  COMMIT
D, [2024-07-29T10:52:31.647951 #126] DEBUG -- :   ↳ lib/tasks/import.rake:13:in `block (2 levels) in <main>'
"Importing /tmp/Records.json for EMAIL@gmail.com, file size is 1443919290... This might take a while, have patience!"
Killed

savvasdalkitsis commented 3 months ago

berger321 commented 3 months ago

same is happening to me, keeps crashing my unraid server

Freika commented 3 months ago

Most likely, you're running out of memory. Consider giving more resources to Dawarich

savvasdalkitsis commented 3 months ago

I am running this on a server with 64GB RAM with no memory restrictions on the docker deployment so I doubt it's that to be honest.

Especially since the job gets killed almost immediately after it starts (for me at least)

savvasdalkitsis commented 3 months ago

is there any way to enable more verbose logging to troubleshoot this further?

savvasdalkitsis commented 3 months ago

In the meantime if anyone else has this problem, this is how i worked around it.

I split the giant Records.json file into multiple smaller ones using this script:

#!/bin/bash

input_file="Records.json"    # The input JSON file
output_prefix="smaller_array"    # The prefix for output files
chunk_size=100000                   # Number of elements per smaller array

# Get the total number of elements in the 'locations' array
total_elements=$(jq '.locations | length' $input_file)

# Loop through the array and split it into chunks
for ((i=0; i<total_elements; i+=chunk_size)); do
    start_index=$i
    end_index=$(($i + $chunk_size - 1))
    output_file="${output_prefix}_$(($i / $chunk_size + 1)).json"

    # Extract the chunk and save it into a new JSON file with the same structure
    jq "{locations: .locations[$start_index:$end_index + 1]}" $input_file > $output_file
    echo "Created $output_file"
done

And I imported each file individually. It takes a lot longer and is manual but hey

Freika commented 2 months ago

@savvasdalkitsis this is awesome, thank you! I'll consider using this approach to automatically split huge files during the import process.

hartmark commented 1 month ago

This worked perfectly for me, so getting dawarich to do this automagically behind the scenes would be awesome.

hartmark commented 1 month ago

my 1.4GB Records.json peaked at around 8GB used by jq using the script, thankfully I have 32GB on my server but a small heads-up for future googlers :)

jonhnet commented 1 month ago

Confirming: I just ran into this problem importing my 2GB Records.json. The bundle processes terminates with the stoic message "Killed" after 59 seconds; at that point top (running inside docker) shows it using 4002572 VIRT (2.0g RES). top also shows "MiB Mem : 19523.4 total", suggesting the memory limit is on the bundle process, not the docker machine.

I didn't see any suspiciously limited resources in the process limits:

/var/app # cat /proc/228/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       77636                77636                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

Anyway, I'm going to give @savvasdalkitsis 's strategy a try.

(I wonder whether re-attempting the imports repeatedly is going to create a database full of duplicate points... I guess that's a problem for another day. :smiley: )

Freika commented 1 month ago

(I wonder whether re-attempting the imports repeatedly is going to create a database full of duplicate points... I guess that's a problem for another day. 😃 )

@jonhnet existing points won't be doubled, that is taken care of :)

jonhnet commented 1 month ago

Better still, I see that the import jobs are identified after the import is complete in the Imports panel, suggesting that there's metadata in the db that could be used to clean things up. "luckily", the failures have all resulted in 0-point imports, so it's a non-event.

jonhnet commented 1 month ago

Here's my take on @savvasdalkitsis 's chunk-ifying script. splitter.py.txt

hartmark commented 1 month ago

Here's my take on @savvasdalkitsis 's chunk-ifying script. splitter.py.txt

Cool, thanks for your updated version which generates a nice script to just add them to dawarich aswell.

My python is quite noobish but you should probably flush the file in the while-loop incase you want to see its progress.

I also noticed that the sizes of the chunks are a bit uneven. Why is 001.json so much bigger?

% dawarich-splitter.py     
loaded record_count=2088720
Writing chunk-1000000-000.json
Writing chunk-1000000-001.json
Writing chunk-1000000-002.json
dawarich-splitter.py  138.83s user 7.53s system 98% cpu 2:29.16 total

% ls -lh *.json
-rw-r--r-- 1 markus markus 1.4G Sep  8 02:26 Records.json
-rw-r--r-- 1 markus markus 533M Sep  8 23:25 chunk-1000000-000.json
-rw-r--r-- 1 markus markus 1.1G Sep  8 23:27 chunk-1000000-001.json
-rw-r--r-- 1 markus markus 194M Sep  8 23:27 chunk-1000000-002.json

jonhnet commented 1 month ago

Chunks are divided by count of records. The records vary wildly in length depending on what Google decided to tuck into them that day. :)

jonhnet commented 1 month ago

Maybe a better script would count up to some total length, since the import process seems to die due to overall memory allocation.

hartmark commented 1 month ago

It seems it just swallowed the 1.1GB file so I guess all is good now :)

One small thing the script also should do is to remove the uploaded files from the docker volume so they won't take storage after being imported.

I'll let it run now for the night and see how it worked tomorrow.

jonhnet commented 1 month ago

@hartmark good point on cleaning up the mess after. I haven't actually finished running the script over here yet. :v)

My first chunk is 1M records. It completed the bundle step in ~1 hour (I didn't measure), but the sidekiq processing seems really slow. It has been 18 hours, I have 273k items processed and another million in the queue. I think because each point creates two work items serially, so I'm 273k/2M of the way through the job. That means this one chunk is going to take 6 days, and I have three more where that came from.

Is this expected performance for import? I can imagine improving it is a low priority, since it only happens once...

Sidekiq shows about 50-60 tasks completing per 10s polling interval. I do notice that the docker container is only running at ~10% CPU, suggesting that admitting more threads might make things a lot faster. I think I'll try docker compose down and bumping BACKGROUND_PROCESSING_CONCURRENCY from 10 to 100.

jonhnet commented 1 month ago

Oh shame on me, this topic is covered in the faq. I tried making 4 extra sidekiq containers, and now it looks like importing 1M records will take about a day. I guess I'll crank it up to 15 to grind through the next 4M records.

I note that if I docker compose stop dawarich, it loses track of the enqueued work items. I'm not sure if that's a bug or desired behavior. I guess the concern might be that a poorly timed shutdown (say to upgrade) might silently leave some recently-uploaded data unprocessed and hence invisible.

jonhnet commented 1 month ago

Here's a better version of my splitter script for enormous Google Takeout Records.json files. The main improvement is that it parses the input incrementally so this script itself doesn't encounter a memory bottleneck.

splitter.py.txt (well, hold off on using this; I sent it before it finished, and my copy broke on one of my Records. I'll update.)

hartmark commented 1 month ago

Nice to have a new updated script. But now I have been able to import my google data.

It looks like quite alot of the items in the queue fails, but there is no retries. I can only see 8hrs back in time but it seems to be just ReverseGeocodingJobs that fails, will it retry the job somehow or how does it work?

Freika / dawarich

Importing large Records.json from Takeout fails with no message #142