Open savvasdalkitsis opened 3 months ago
same is happening to me, keeps crashing my unraid server
Most likely, you're running out of memory. Consider giving more resources to Dawarich
I am running this on a server with 64GB RAM with no memory restrictions on the docker deployment so I doubt it's that to be honest.
Especially since the job gets killed almost immediately after it starts (for me at least)
is there any way to enable more verbose logging to troubleshoot this further?
In the meantime if anyone else has this problem, this is how i worked around it.
I split the giant Records.json file into multiple smaller ones using this script:
#!/bin/bash
input_file="Records.json" # The input JSON file
output_prefix="smaller_array" # The prefix for output files
chunk_size=100000 # Number of elements per smaller array
# Get the total number of elements in the 'locations' array
total_elements=$(jq '.locations | length' $input_file)
# Loop through the array and split it into chunks
for ((i=0; i<total_elements; i+=chunk_size)); do
start_index=$i
end_index=$(($i + $chunk_size - 1))
output_file="${output_prefix}_$(($i / $chunk_size + 1)).json"
# Extract the chunk and save it into a new JSON file with the same structure
jq "{locations: .locations[$start_index:$end_index + 1]}" $input_file > $output_file
echo "Created $output_file"
done
And I imported each file individually. It takes a lot longer and is manual but hey
@savvasdalkitsis this is awesome, thank you! I'll consider using this approach to automatically split huge files during the import process.
This worked perfectly for me, so getting dawarich to do this automagically behind the scenes would be awesome.
my 1.4GB Records.json peaked at around 8GB used by jq using the script, thankfully I have 32GB on my server but a small heads-up for future googlers :)
Confirming: I just ran into this problem importing my 2GB Records.json. The bundle processes terminates with the stoic message "Killed" after 59 seconds; at that point top (running inside docker) shows it using 4002572 VIRT (2.0g RES). top also shows "MiB Mem : 19523.4 total", suggesting the memory limit is on the bundle process, not the docker machine.
I didn't see any suspiciously limited resources in the process limits:
/var/app # cat /proc/228/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes unlimited unlimited processes
Max open files 1048576 1048576 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 77636 77636 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Anyway, I'm going to give @savvasdalkitsis 's strategy a try.
(I wonder whether re-attempting the imports repeatedly is going to create a database full of duplicate points... I guess that's a problem for another day. :smiley: )
(I wonder whether re-attempting the imports repeatedly is going to create a database full of duplicate points... I guess that's a problem for another day. 😃 )
@jonhnet existing points won't be doubled, that is taken care of :)
Better still, I see that the import jobs are identified after the import is complete in the Imports panel, suggesting that there's metadata in the db that could be used to clean things up. "luckily", the failures have all resulted in 0-point imports, so it's a non-event.
Here's my take on @savvasdalkitsis 's chunk-ifying script. splitter.py.txt
Here's my take on @savvasdalkitsis 's chunk-ifying script. splitter.py.txt
Cool, thanks for your updated version which generates a nice script to just add them to dawarich aswell.
My python is quite noobish but you should probably flush the file in the while-loop incase you want to see its progress.
I also noticed that the sizes of the chunks are a bit uneven. Why is 001.json so much bigger?
% dawarich-splitter.py
loaded record_count=2088720
Writing chunk-1000000-000.json
Writing chunk-1000000-001.json
Writing chunk-1000000-002.json
dawarich-splitter.py 138.83s user 7.53s system 98% cpu 2:29.16 total
% ls -lh *.json
-rw-r--r-- 1 markus markus 1.4G Sep 8 02:26 Records.json
-rw-r--r-- 1 markus markus 533M Sep 8 23:25 chunk-1000000-000.json
-rw-r--r-- 1 markus markus 1.1G Sep 8 23:27 chunk-1000000-001.json
-rw-r--r-- 1 markus markus 194M Sep 8 23:27 chunk-1000000-002.json
Chunks are divided by count of records. The records vary wildly in length depending on what Google decided to tuck into them that day. :)
Maybe a better script would count up to some total length, since the import process seems to die due to overall memory allocation.
It seems it just swallowed the 1.1GB file so I guess all is good now :)
One small thing the script also should do is to remove the uploaded files from the docker volume so they won't take storage after being imported.
I'll let it run now for the night and see how it worked tomorrow.
@hartmark good point on cleaning up the mess after. I haven't actually finished running the script over here yet. :v)
My first chunk is 1M records. It completed the bundle
step in ~1 hour (I didn't measure), but the sidekiq processing seems really slow. It has been 18 hours, I have 273k items processed and another million in the queue. I think because each point creates two work items serially, so I'm 273k/2M of the way through the job. That means this one chunk is going to take 6 days, and I have three more where that came from.
Is this expected performance for import? I can imagine improving it is a low priority, since it only happens once...
Sidekiq shows about 50-60 tasks completing per 10s polling interval. I do notice that the docker container is only running at ~10% CPU, suggesting that admitting more threads might make things a lot faster. I think I'll try docker compose down
and bumping BACKGROUND_PROCESSING_CONCURRENCY
from 10 to 100.
Oh shame on me, this topic is covered in the faq. I tried making 4 extra sidekiq containers, and now it looks like importing 1M records will take about a day. I guess I'll crank it up to 15 to grind through the next 4M records.
I note that if I docker compose stop dawarich, it loses track of the enqueued work items. I'm not sure if that's a bug or desired behavior. I guess the concern might be that a poorly timed shutdown (say to upgrade) might silently leave some recently-uploaded data unprocessed and hence invisible.
Here's a better version of my splitter script for enormous Google Takeout Records.json files. The main improvement is that it parses the input incrementally so this script itself doesn't encounter a memory bottleneck.
splitter.py.txt (well, hold off on using this; I sent it before it finished, and my copy broke on one of my Records. I'll update.)
Nice to have a new updated script. But now I have been able to import my google data.
It looks like quite alot of the items in the queue fails, but there is no retries. I can only see 8hrs back in time but it seems to be just ReverseGeocodingJobs that fails, will it retry the job somehow or how does it work?
Describe the bug Importing my (large) location history from Google Takeout fails with no error messages
Version freikin/dawarich:latest
Logs