cognoma / core-service

Cognoma Core API
Other
9 stars 12 forks source link

python manage.py loaddata is killed mysteriously while loading mutations #57

Closed kurtwheeler closed 7 years ago

kurtwheeler commented 7 years ago

While attempting to solve https://github.com/cognoma/core-service/issues/56#issuecomment-305877278 @dhimmel and I ran python manage.py loaddata on one of the EC2 instances within the running Docker container. The script got Killed mysteriously. Here is the output of said script:

root@core-service:/code# python manage.py loaddata                  
Loading mutations table...
Processing 1000 rows so far
Processing 2000 rows so far
Processing 3000 rows so far
Processing 4000 rows so far
Processing 5000 rows so far
Processing 6000 rows so far
Processing 7000 rows so far
Bulk loading mutation data...
Killed
root@core-service:/code# $?
bash: 137: command not found

We researched what error code 137 means and it appears to be a SIGKILL with priority 9. We cannot determine what would be sending that signal. @dhimmel thinks it may be caused by running out of memory, however we monitored the memory usage during the execution and it didn't exceed 23% of the memory. We tried multiple times to execute this command, some of which it died before getting as far as it did in the above output.

This is the relevant code block where the command is getting murdered.

We used the API to inspect the number of diseases, samples, and genes and those tables all seem to have been populated successfully.

@awm33 @stephenshank any ideas?

dcgoss commented 7 years ago

Haven't looked in depth, but did come across this relevant stack overflow answer that you may have already seen: https://stackoverflow.com/questions/19189522/what-does-killed-mean On Fri, Jun 2, 2017 at 3:20 PM Kurt Wheeler notifications@github.com wrote:

While attempting to solve #56 (comment) https://github.com/cognoma/core-service/issues/56#issuecomment-305877278 @dhimmel https://github.com/dhimmel and I ran python manage.py loaddata on one of the EC2 instances within the running Docker container. The script got Killed mysteriously. Here is the output of said script:

root@core-service:/code# python manage.py loaddata Loading mutations table... Processing 1000 rows so far Processing 2000 rows so far Processing 3000 rows so far Processing 4000 rows so far Processing 5000 rows so far Processing 6000 rows so far Processing 7000 rows so far Bulk loading mutation data... Killed root@core-service:/code# $? bash: 137: command not found

We researched what error code 137 means and it appears to be a SIGKILL with priority 9. We cannot determine what would be sending that signal. @dhimmel https://github.com/dhimmel thinks it may be caused by running out of memory, however we monitored the memory usage during the execution and it didn't exceed 23% of the memory. We tried multiple times to execute this command, some of which it died before getting as far as it did in the above output.

This https://github.com/cognoma/core-service/blob/7f185bc51a73e1cf0571cb2ed601b381f5c49d79/api/management/commands/loaddata.py#L75-L95 is the relevant code block where the command is getting murdered.

We used the API to inspect the number of diseases, samples, and genes and those tables all seem to have been populated successfully.

@awm33 https://github.com/awm33 @stephenshank https://github.com/stephenshank any ideas?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cognoma/core-service/issues/57, or mute the thread https://github.com/notifications/unsubscribe-auth/AHQLEoCQyP2dezmIWJOvCCea46yGeB9Lks5sAGCHgaJpZM4Nure- .

dcgoss commented 7 years ago

After some investigation using dmesg it appears the culprit is indeed memory. The python process is killed by the Linux oom-killer. Here are the logs: screen shot 2017-06-19 at 1 47 06 pm screen shot 2017-06-19 at 1 47 40 pm screen shot 2017-06-19 at 1 47 56 pm

dhimmel commented 7 years ago

Nice forensics! @dcgoss do you want to modify the source to be more memory efficient.

dcgoss commented 7 years ago

@dhimmel Yeah I'm checking that out now

kurtwheeler commented 7 years ago

I believe those machines should have more memory than that. The containers may be artificially limited by their task definitions. That is a setting within AWS' ECS. There is a directory in the infrastructure repo that has them. Might be worth taking a look at what the settings are.

On Jun 19, 2017 1:52 PM, "Derek Goss" notifications@github.com wrote:

@dhimmel https://github.com/dhimmel Yeah I'm checking that out now

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cognoma/core-service/issues/57#issuecomment-309516364, or mute the thread https://github.com/notifications/unsubscribe-auth/AD1x0eg6j7PDpPtZhCB6HyAPUy-5nLXoks5sFrVrgaJpZM4Nure- .

dcgoss commented 7 years ago

Solved. I was doing some intense work on the script and I rewrote it two times, before discovering that for some reason on the server the Mutation.objects.bulk_create(mutation_list, batch_size=1000) line was only Mutation.objects.bulk_create(mutation_list). Django was trying to create all the objects at once, using too much memory. It appears that commit 7f185bc51a73e1cf0571cb2ed601b381f5c49d79 was not deployed to the server. I edited the file and loaded the data. Perhaps this is a reminder to address #52.