Open stefandiederich opened 3 years ago
We admittedly don't do a great job propagating the python logging back through java.
@asmirnov239 do you have any suggestions for debugging?
@stefandiederich Could you try using count files in TSV format (instead of HDF5) and see if that works? CollectReadCounts
tool has an option to output TSV files.
If it doesn't, I would need some more information about the inputs to see what's going on.
Thanks!
@ldgauthier @asmirnov239
Thanks for your quick reply! In the meantime I investigated a little more by coping the temp dir before the Software exited and started the python command by its own. With that I got the error Segmentation fault (core dumped)
. With googling that I got to the point, that it might be some sort of memory problem. The server has 512 GB and I did not see mor than 10 percent used until the GermlineCNVCaller stops but I did Scatter the Genome interval_list into 5 parts and run them seperatly. With success.
The jobs are running right now for about 24h. But I think thats normal for 44 WGS samples and 5 scatters.
One other question: We also have 4x A100 GPUs from NVIDIA in that machine. Is the algorithm somehow able to use them?
Bests Stefan
@stefandiederich Great, did the jobs finish? For the reference we usually scatter in blocks of size 12500 for which we usually end up needing 16GB of RAM.
We do not support running on gCNV on GPUs, although in theory the underlying library Theano that we use supports it, although there might be some CUDA/Theano incompatibility issues with newest versions of CUDA.
What environment are you running in? I would highly suggest using the GATK docker if you aren't already, and also increasing the number of shards (~10000-20000 intervals per shard as @asmirnov239 suggested).
The server we are using has 64 cores an 512GB RAM. But the joub did not finish up to now. So I will stop that now and usre more scatters like @mwalker174 @asmirnov239 suggested.
Bests Stefan
This problem is about the skylake and later cpu incompatibility with one of the ML libraries used by the python environment. The only solution is to fall back to 4.1.7.0 docker image and use that for gCNV. This problem still persists and there is even one issue that I opened sometime ago for 4.1.9.0.
https://gatk.broadinstitute.org/hc/en-us/community/posts/360075117572-GermlineCNVCaller-edge-case-
Hi,
I tried to build a gCNV model and got the error
python exited with 139
but can#t figure out what is the cause of this error. Can you please help me with that error message? I attended the whole command and output hereThanks for any help. Stefan