brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

Runtime for ancestry estimate? #61

Open bhanratt opened 3 years ago

bhanratt commented 3 years ago

Hi, I am running the module on a set of samples. The first part up until the .tsv generated ran fine. It took a bit longer than an hour I think. But the next step after that, which seems to be generating an HTML report, is taking forever. As of right now, it has been on this step for 22 hours. How long should I expect this step to take? It hasn't really provided any useful stdout or stderr so I am not sure if it just needs more time or it has failed. I checked the process in top and it is using a cpu at about 100% still.


total 585K
-rw-rw---- 1 bhanratt 209K Sep  9 21:57 10-448.P20.blood.somalier
-rw-rw---- 1 bhanratt 209K Sep  9 21:57 10-448.P29.blood.somalier
-rw-rw---- 1 bhanratt 209K Sep  9 21:57 6115396-B-1.somalier
-rw-rw---- 1 bhanratt 209K Sep  9 21:57 6115399-B-1.somalier
-rw-rw---- 1 bhanratt 209K Sep  9 21:58 SC_9142_Normal.somalier
somalier ancestry --labels ancestry-labels-1kg.tsv 1kg-somalier/*.somalier ++ samples/*.somalier
somalier version: 0.2.11
[somalier] subset from 17384 to 5451 high call-rate sites (removed 68.64%)
[somalier] time for dimensionality reduction to shape [2504, 5]: 30.84 seconds
[somalier] Epoch:0. loss: 0.20379. accuracy on unseen data: 0.931.  total-time: 0.90
[somalier] Epoch:500. loss: 0.06828. accuracy on unseen data: 0.970.  total-time: 418.08
[somalier] Epoch:1000. loss: 0.03590. accuracy on unseen data: 1.000.  total-time: 831.99
[somalier] Epoch:1500. loss: 0.05903. accuracy on unseen data: 0.960.  total-time: 1291.07
[somalier] Epoch:2000. loss: 0.03614. accuracy on unseen data: 1.000.  total-time: 1698.36
[somalier] Epoch:2500. loss: 0.03091. accuracy on unseen data: 0.990.  total-time: 2089.77
[somalier] Epoch:3000. loss: 0.03701. accuracy on unseen data: 0.990.  total-time: 2484.96
[somalier] Epoch:3500. loss: 0.02943. accuracy on unseen data: 0.990.  total-time: 2890.34
[somalier] Epoch:4000. loss: 0.03663. accuracy on unseen data: 0.980.  total-time: 3375.05
[somalier] Epoch:4500. loss: 0.03792. accuracy on unseen data: 0.980.  total-time: 3815.02
[somalier] Epoch:5000. loss: 0.02379. accuracy on unseen data: 1.000.  total-time: 4217.97
[somalier] breaking with trained model at this accuracy and loss
[somalier] reduced query set to: [5, 5]
[somalier] wrote text file to somalier-ancestry.somalier-ancestry.tsv

$:~/somalier$ date
Fri Sep 11 12:19:56 PDT 2020
$:~/somalier$ ls -lht | head
total 291M
-rw-rw---- 1 bhanratt 221K Sep 10 13:56 somalier-ancestry.somalier-ancestry.tsv
-rw-rw---- 1 bhanratt    0 Sep 10 13:54 somalier-ancestry.somalier-ancestry.html

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                              
11937 bhanratt  20   0 2230072 0.995g   2828 R 100.0  0.1   3877:14 somalier   ```
brentp commented 3 years ago

This is on only 5 samples? Should be nearly instant, so probably something is going wrong. Can you share the somalier files?

somalier_debug.gz

You could also try running with the debug build (attached here) and see if you get any additional information (though it will be slower).

bhanratt commented 3 years ago

This is on only 5 samples? Should be nearly instant, so probably something is going wrong. Can you share the somalier files?

somalier_debug.gz

You could also try running with the debug build (attached here) and see if you get any additional information (though it will be slower).

I emailed you the somalier files. Let me know if I should send them in some other way.

brentp commented 3 years ago

Hi, so these 5 samples all have IBS0 of 0 to each other. Are they all from the same person? That is a little weird, but I can get this to run very quickly:

 time ./somalier ancestry --labels ../scripts/ancestry-labels-1kg.tsv  ./1kg/1kg-somalier/*.somalier ++ *.somalier -o n #--n-pcs 5
somalier version: 0.2.11
[somalier] subset from 17384 to 5451 high call-rate sites (removed 68.64%)
[somalier] time for dimensionality reduction to shape [2504, 5]: 0.61 seconds
[somalier] Epoch:0. loss: 0.64646. accuracy on unseen data: 0.842.  total-time: 0.06
[somalier] Epoch:500. loss: 0.11340. accuracy on unseen data: 0.950.  total-time: 1.62
[somalier] Epoch:1000. loss: 0.08213. accuracy on unseen data: 0.980.  total-time: 3.06
[somalier] Epoch:1500. loss: 0.13003. accuracy on unseen data: 0.970.  total-time: 4.65
[somalier] Epoch:2000. loss: 0.11380. accuracy on unseen data: 0.950.  total-time: 6.09
[somalier] Epoch:2500. loss: 0.11965. accuracy on unseen data: 0.970.  total-time: 7.70
[somalier] Epoch:3000. loss: 0.08237. accuracy on unseen data: 0.960.  total-time: 9.07
[somalier] Epoch:3500. loss: 0.07352. accuracy on unseen data: 0.980.  total-time: 10.53
[somalier] Epoch:4000. loss: 0.13536. accuracy on unseen data: 0.970.  total-time: 11.98
[somalier] Epoch:4500. loss: 0.14505. accuracy on unseen data: 0.950.  total-time: 13.46
[somalier] Epoch:5000. loss: 0.06106. accuracy on unseen data: 0.980.  total-time: 14.98
[somalier] Epoch:5500. loss: 0.15694. accuracy on unseen data: 0.921.  total-time: 16.52
[somalier] Epoch:6000. loss: 0.13266. accuracy on unseen data: 0.970.  total-time: 17.98
[somalier] Epoch:6500. loss: 0.04489. accuracy on unseen data: 0.980.  total-time: 19.56
[somalier] Epoch:7000. loss: 0.10876. accuracy on unseen data: 0.941.  total-time: 20.92
[somalier] Epoch:7500. loss: 0.09172. accuracy on unseen data: 0.960.  total-time: 22.38
[somalier] Epoch:8000. loss: 0.20483. accuracy on unseen data: 0.931.  total-time: 24.05
[somalier] Epoch:8500. loss: 0.10724. accuracy on unseen data: 0.970.  total-time: 25.44
[somalier] Epoch:9000. loss: 0.09659. accuracy on unseen data: 0.970.  total-time: 27.06
[somalier] Epoch:9500. loss: 0.08204. accuracy on unseen data: 0.980.  total-time: 28.64
[somalier] Epoch:10000. loss: 0.11874. accuracy on unseen data: 0.950.  total-time: 30.14
[somalier] reduced query set to: [5, 5]
[somalier] wrote text file to n.somalier-ancestry.tsv
[somalier] wrote html file to n.somalier-ancestry.html

real    0m33.094s
user    6m14.218s
sys 0m3.814s

maybe you are on an older machine without sse2? you can check with:

grep sse2 /proc/cpuinfo

other than that, I don't see why it would be taking so long.

davmlaw commented 3 years ago

FYI I've also noticed some performance issues with ancestry, if the amount of available CPUs is low

For instance on a 4 core machine, no other load, a job takes

real 0m45.735s user 1m38.921s

But when running 3 VEP tasks (1 core each), Somalier used 150-190% CPU according to top and I killed it after 40 minutes, without finishing - is it possible the scheduler is eating up the CPU here and not doing any work? Could you make an option to drop to a better performing single core mode?

I can reproduce this pretty well, ie if I suspend the Perl jobs after epoch 1000 you can see it rapidly picks up the pace:

[somalier] subset from 17384 to 15196 high call-rate sites (removed 12.59%)
[somalier] time for dimensionality reduction to shape [2504, 5]: 22.05 seconds
[somalier] Epoch:0. loss: 0.13361. accuracy on unseen data: 0.960.  total-time: 0.29
[somalier] Epoch:500. loss: 0.05091. accuracy on unseen data: 0.980.  total-time: 72.86
[somalier] Epoch:1000. loss: 0.04887. accuracy on unseen data: 0.980.  total-time: 100.80
[somalier] Epoch:1500. loss: 0.06615. accuracy on unseen data: 0.980.  total-time: 112.34
[somalier] Epoch:2000. loss: 0.03106. accuracy on unseen data: 0.990.  total-time: 113.43
[somalier] Epoch:2500. loss: 0.03452. accuracy on unseen data: 0.990.  total-time: 114.53
[somalier] Epoch:3000. loss: 0.08126. accuracy on unseen data: 0.970.  total-time: 115.71
[somalier] Epoch:3500. loss: 0.06027. accuracy on unseen data: 0.970.  total-time: 116.82
[somalier] Epoch:4000. loss: 0.05955. accuracy on unseen data: 0.970.  total-time: 117.94
[somalier] Epoch:4500. loss: 0.06031. accuracy on unseen data: 0.980.  total-time: 119.02
[somalier] Epoch:5000. loss: 0.05147. accuracy on unseen data: 0.990.  total-time: 120.30
[somalier] Epoch:5500. loss: 0.03082. accuracy on unseen data: 0.990.  total-time: 121.44
[somalier] Epoch:6000. loss: 0.01930. accuracy on unseen data: 0.990.  total-time: 122.56
[somalier] Epoch:6500. loss: 0.01656. accuracy on unseen data: 1.000.  total-time: 123.69
[somalier] breaking with trained model at this accuracy and loss
[somalier] reduced query set to: [14, 5]
[somalier] wrote text file to somalier-ancestry.somalier-ancestry.tsv
[somalier] wrote html file to somalier-ancestry.somalier-ancestry.html
brentp commented 3 years ago

hi, you can try:

export OMP_NUM_THREADS=1
somalier ancestry ...

you can also try setting to 0, though I'm not sure what that does. ancestry is still experimental and needs more development time.

davmlaw commented 3 years ago

This works for me.

Previously I killed a job after 1 hours 12 mins, while finished in under a minute with 1 thread