broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.7k stars 591 forks source link

GermlineCNVCaller - python exited with 139 #7234

Open stefandiederich opened 3 years ago

stefandiederich commented 3 years ago

Hi,

I tried to build a gCNV model and got the error python exited with 139 but can#t figure out what is the cause of this error. Can you please help me with that error message? I attended the whole command and output here

(gatk4.2.0.0) k-hg-srv3:/media/Data/AnnotationDBs/CNV/Genom/hdf5 # gatk GermlineCNVCaller --run-mode COHORT -L ../Genom.filtered.interval_list  -I 0028-21.hdf5 -I 0045-21.hdf5 -I 0098-18.hdf5 -I 0156-21.hdf5 -I 0429-20.hdf5 -I 0779-18.hdf5 -I 1030-20.hdf5 -I 1098-13.hdf5 -I 1450-20.hdf5 -I 1495-17.hdf5 -I 1575-20.hdf5 -I 1586-18.hdf5 -I 1658-14.hdf5 -I 1974-20.hdf5 -I 2008-20.hdf5 -I 0030-21.hdf5 -I 0058-21.hdf5 -I 0129-20.hdf5 -I 0249-04.hdf5 -I 0614-20.hdf5 -I 0834-19.hdf5 -I 1080-20.hdf5 -I 1331-18.hdf5 -I 1460-18.hdf5 -I 1498-18.hdf5 -I 1576-20.hdf5 -I 1592-20.hdf5 -I 1716-15.hdf5 -I 1985-20.hdf5 -I 2167-20.hdf5 -I 0038-21.hdf5 -I 0094-21.hdf5 -I 0139-18.hdf5 -I 0345-20.hdf5 -I 0641-18.hdf5 -I 0949-20.hdf5 -I 1081-20.hdf5 -I 1416-20.hdf5 -I 1491-20.hdf5 -I 1553-18.hdf5 -I 1577-20.hdf5 -I 1600-20.hdf5 -I 1720-20.hdf5 -I 1995-20.hdf5 --contig-ploidy-calls ../ploidy-calls/ --annotated-intervals ../Genom.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output /media/Data/AnnotationDBs/CNV/Genom --output-prefix CNV --tmp-dir /media/Data/tmp/
Using GATK jar /usr/BioinfSoftware/GATK/4.2.0.0/gatk-package-4.2.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /usr/BioinfSoftware/GATK/4.2.0.0/gatk-package-4.2.0.0-local.jar GermlineCNVCaller --run-mode COHORT -L ../Genom.filtered.interval_list -I 0028-21.hdf5 -I 0045-21.hdf5 -I 0098-18.hdf5 -I 0156-21.hdf5 -I 0429-20.hdf5 -I 0779-18.hdf5 -I 1030-20.hdf5 -I 1098-13.hdf5 -I 1450-20.hdf5 -I 1495-17.hdf5 -I 1575-20.hdf5 -I 1586-18.hdf5 -I 1658-14.hdf5 -I 1974-20.hdf5 -I 2008-20.hdf5 -I 0030-21.hdf5 -I 0058-21.hdf5 -I 0129-20.hdf5 -I 0249-04.hdf5 -I 0614-20.hdf5 -I 0834-19.hdf5 -I 1080-20.hdf5 -I 1331-18.hdf5 -I 1460-18.hdf5 -I 1498-18.hdf5 -I 1576-20.hdf5 -I 1592-20.hdf5 -I 1716-15.hdf5 -I 1985-20.hdf5 -I 2167-20.hdf5 -I 0038-21.hdf5 -I 0094-21.hdf5 -I 0139-18.hdf5 -I 0345-20.hdf5 -I 0641-18.hdf5 -I 0949-20.hdf5 -I 1081-20.hdf5 -I 1416-20.hdf5 -I 1491-20.hdf5 -I 1553-18.hdf5 -I 1577-20.hdf5 -I 1600-20.hdf5 -I 1720-20.hdf5 -I 1995-20.hdf5 --contig-ploidy-calls ../ploidy-calls/ --annotated-intervals ../Genom.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output /media/Data/AnnotationDBs/CNV/Genom --output-prefix CNV --tmp-dir /media/Data/tmp/
17:28:28.660 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/BioinfSoftware/GATK/4.2.0.0/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Apr 27, 2021 5:28:28 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
17:28:28.779 INFO  GermlineCNVCaller - ------------------------------------------------------------
17:28:28.779 INFO  GermlineCNVCaller - The Genome Analysis Toolkit (GATK) v4.2.0.0
17:28:28.779 INFO  GermlineCNVCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
17:28:28.779 INFO  GermlineCNVCaller - Executing as root@k-hg-srv3 on Linux v5.3.18-24.37-default amd64
17:28:28.780 INFO  GermlineCNVCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.8+10-suse-3.45.1-x8664
17:28:28.780 INFO  GermlineCNVCaller - Start Date/Time: April 27, 2021 at 5:28:28 PM CEST
17:28:28.780 INFO  GermlineCNVCaller - ------------------------------------------------------------
17:28:28.780 INFO  GermlineCNVCaller - ------------------------------------------------------------
17:28:28.781 INFO  GermlineCNVCaller - HTSJDK Version: 2.24.0
17:28:28.781 INFO  GermlineCNVCaller - Picard Version: 2.25.0
17:28:28.781 INFO  GermlineCNVCaller - Built for Spark Version: 2.4.5
17:28:28.781 INFO  GermlineCNVCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:28:28.781 INFO  GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:28:28.781 INFO  GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:28:28.782 INFO  GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:28:28.782 INFO  GermlineCNVCaller - Deflater: IntelDeflater
17:28:28.782 INFO  GermlineCNVCaller - Inflater: IntelInflater
17:28:28.782 INFO  GermlineCNVCaller - GCS max retries/reopens: 20
17:28:28.782 INFO  GermlineCNVCaller - Requester pays: disabled
17:28:28.782 INFO  GermlineCNVCaller - Initializing engine
17:28:34.716 INFO  GermlineCNVCaller - Done initializing engine
17:28:34.723 INFO  GermlineCNVCaller - Intervals specified...
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
17:28:35.689 INFO  FeatureManager - Using codec IntervalListCodec to read file file:///media/Data/AnnotationDBs/CNV/Genom/hdf5/../Genom.filtered.interval_list
17:28:42.892 INFO  IntervalArgumentCollection - Processing 2741406000 bp from intervals
17:28:43.237 INFO  GermlineCNVCaller - Reading and validating annotated intervals...
17:28:51.740 INFO  GermlineCNVCaller - GC-content annotations for intervals found; explicit GC-bias correction will be performed...
17:28:57.410 INFO  GermlineCNVCaller - Running the tool in COHORT mode...
17:28:57.410 INFO  GermlineCNVCaller - Validating and aggregating data from input read-count files...
17:28:57.940 INFO  GermlineCNVCaller - Aggregating read-count file 0028-21.hdf5 (1 / 44)
17:29:00.837 INFO  GermlineCNVCaller - Aggregating read-count file 0045-21.hdf5 (2 / 44)
17:29:03.690 INFO  GermlineCNVCaller - Aggregating read-count file 0098-18.hdf5 (3 / 44)
17:29:06.658 INFO  GermlineCNVCaller - Aggregating read-count file 0156-21.hdf5 (4 / 44)
17:29:09.435 INFO  GermlineCNVCaller - Aggregating read-count file 0429-20.hdf5 (5 / 44)
17:29:12.235 INFO  GermlineCNVCaller - Aggregating read-count file 0779-18.hdf5 (6 / 44)
17:29:14.939 INFO  GermlineCNVCaller - Aggregating read-count file 1030-20.hdf5 (7 / 44)
17:29:17.822 INFO  GermlineCNVCaller - Aggregating read-count file 1098-13.hdf5 (8 / 44)
17:29:20.668 INFO  GermlineCNVCaller - Aggregating read-count file 1450-20.hdf5 (9 / 44)
17:29:23.485 INFO  GermlineCNVCaller - Aggregating read-count file 1495-17.hdf5 (10 / 44)
17:29:26.245 INFO  GermlineCNVCaller - Aggregating read-count file 1575-20.hdf5 (11 / 44)
17:29:29.120 INFO  GermlineCNVCaller - Aggregating read-count file 1586-18.hdf5 (12 / 44)
17:29:31.955 INFO  GermlineCNVCaller - Aggregating read-count file 1658-14.hdf5 (13 / 44)
17:29:34.810 INFO  GermlineCNVCaller - Aggregating read-count file 1974-20.hdf5 (14 / 44)
17:29:37.619 INFO  GermlineCNVCaller - Aggregating read-count file 2008-20.hdf5 (15 / 44)
17:29:40.488 INFO  GermlineCNVCaller - Aggregating read-count file 0030-21.hdf5 (16 / 44)
17:29:43.405 INFO  GermlineCNVCaller - Aggregating read-count file 0058-21.hdf5 (17 / 44)
17:29:46.303 INFO  GermlineCNVCaller - Aggregating read-count file 0129-20.hdf5 (18 / 44)
17:29:49.079 INFO  GermlineCNVCaller - Aggregating read-count file 0249-04.hdf5 (19 / 44)
17:29:51.843 INFO  GermlineCNVCaller - Aggregating read-count file 0614-20.hdf5 (20 / 44)
17:29:54.900 INFO  GermlineCNVCaller - Aggregating read-count file 0834-19.hdf5 (21 / 44)
17:29:57.719 INFO  GermlineCNVCaller - Aggregating read-count file 1080-20.hdf5 (22 / 44)
17:30:00.595 INFO  GermlineCNVCaller - Aggregating read-count file 1331-18.hdf5 (23 / 44)
17:30:03.394 INFO  GermlineCNVCaller - Aggregating read-count file 1460-18.hdf5 (24 / 44)
17:30:06.217 INFO  GermlineCNVCaller - Aggregating read-count file 1498-18.hdf5 (25 / 44)
17:30:09.074 INFO  GermlineCNVCaller - Aggregating read-count file 1576-20.hdf5 (26 / 44)
17:30:11.852 INFO  GermlineCNVCaller - Aggregating read-count file 1592-20.hdf5 (27 / 44)
17:30:14.711 INFO  GermlineCNVCaller - Aggregating read-count file 1716-15.hdf5 (28 / 44)
17:30:17.375 INFO  GermlineCNVCaller - Aggregating read-count file 1985-20.hdf5 (29 / 44)
17:30:19.997 INFO  GermlineCNVCaller - Aggregating read-count file 2167-20.hdf5 (30 / 44)
17:30:22.744 INFO  GermlineCNVCaller - Aggregating read-count file 0038-21.hdf5 (31 / 44)
17:30:25.520 INFO  GermlineCNVCaller - Aggregating read-count file 0094-21.hdf5 (32 / 44)
17:30:28.090 INFO  GermlineCNVCaller - Aggregating read-count file 0139-18.hdf5 (33 / 44)
17:30:31.005 INFO  GermlineCNVCaller - Aggregating read-count file 0345-20.hdf5 (34 / 44)
17:30:33.666 INFO  GermlineCNVCaller - Aggregating read-count file 0641-18.hdf5 (35 / 44)
17:30:36.381 INFO  GermlineCNVCaller - Aggregating read-count file 0949-20.hdf5 (36 / 44)
17:30:39.115 INFO  GermlineCNVCaller - Aggregating read-count file 1081-20.hdf5 (37 / 44)
17:30:41.937 INFO  GermlineCNVCaller - Aggregating read-count file 1416-20.hdf5 (38 / 44)
17:30:44.527 INFO  GermlineCNVCaller - Aggregating read-count file 1491-20.hdf5 (39 / 44)
17:30:47.177 INFO  GermlineCNVCaller - Aggregating read-count file 1553-18.hdf5 (40 / 44)
17:30:49.901 INFO  GermlineCNVCaller - Aggregating read-count file 1577-20.hdf5 (41 / 44)
17:30:52.733 INFO  GermlineCNVCaller - Aggregating read-count file 1600-20.hdf5 (42 / 44)
17:30:55.582 INFO  GermlineCNVCaller - Aggregating read-count file 1720-20.hdf5 (43 / 44)
17:30:58.449 INFO  GermlineCNVCaller - Aggregating read-count file 1995-20.hdf5 (44 / 44)
17:40:55.029 INFO  GermlineCNVCaller - Shutting down engine
[April 27, 2021 at 5:40:55 PM CEST] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 12.44 minutes.
Runtime.totalMemory()=11144265728
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 139
Command Line: python /media/Data/tmp/cohort_denoising_calling.4407709016674150942.py --ploidy_calls_path=/media/Data/AnnotationDBs/CNV/Genom/ploidy-calls --output_calls_path=/media/Data/AnnotationDBs/CNV/Genom/CNV-calls --output_tracking_path=/media/Data/AnnotationDBs/CNV/Genom/CNV-tracking --random_seed=1984 --modeling_interval_list=/media/Data/tmp/intervals8808982738430140650.tsv --output_model_path=/media/Data/AnnotationDBs/CNV/Genom/CNV-model --enable_explicit_gc_bias_modeling=True --read_count_tsv_files /media/Data/tmp/0028-2117281826103999737636.tsv /media/Data/tmp/0045-2114178578414165773973.tsv /media/Data/tmp/0098-187303293573572392847.tsv /media/Data/tmp/0156-2116276973316025582383.tsv /media/Data/tmp/0429-2011369855358784624026.tsv /media/Data/tmp/0779-182946833054276408036.tsv /media/Data/tmp/1030-2011821460036972619304.tsv /media/Data/tmp/1098-1316812566706274946839.tsv /media/Data/tmp/1450-2017928463889973924144.tsv /media/Data/tmp/1495-175366865479116778359.tsv /media/Data/tmp/1575-209069379574618821045.tsv /media/Data/tmp/1586-1812077372458596443770.tsv /media/Data/tmp/1658-1416796345457468836767.tsv /media/Data/tmp/1974-20963075643313577244.tsv /media/Data/tmp/2008-201684629477730666859.tsv /media/Data/tmp/0030-2111734942852719299107.tsv /media/Data/tmp/0058-2112304138281766695594.tsv /media/Data/tmp/0129-2017026491127246322921.tsv /media/Data/tmp/0249-0410541106753972677491.tsv /media/Data/tmp/0614-2017496174200039498728.tsv /media/Data/tmp/0834-199803801680440061953.tsv /media/Data/tmp/1080-2013078864772320148454.tsv /media/Data/tmp/1331-1814795580985295275657.tsv /media/Data/tmp/1460-1812224511017021060117.tsv /media/Data/tmp/1498-1815280588962310294947.tsv /media/Data/tmp/1576-207045321202572611926.tsv /media/Data/tmp/1592-209860253943431763905.tsv /media/Data/tmp/1716-1510666032120924379526.tsv /media/Data/tmp/1985-2014749030470842993852.tsv /media/Data/tmp/2167-202903493062308891070.tsv /media/Data/tmp/0038-21598990479976813763.tsv /media/Data/tmp/0094-216854451375226115483.tsv /media/Data/tmp/0139-185406367434721880080.tsv /media/Data/tmp/0345-2014163238955879667062.tsv /media/Data/tmp/0641-181435783797181261458.tsv /media/Data/tmp/0949-2010483620729129352550.tsv /media/Data/tmp/1081-2016076156488145247843.tsv /media/Data/tmp/1416-2017418107116732797290.tsv /media/Data/tmp/1491-2010215348808759639313.tsv /media/Data/tmp/1553-1815589928704402438644.tsv /media/Data/tmp/1577-2013574592700415434279.tsv /media/Data/tmp/1600-2011997428618111716171.tsv /media/Data/tmp/1720-209532087127617815835.tsv /media/Data/tmp/1995-201322738780974211823.tsv --psi_s_scale=1.000000e-04 --mapping_error_rate=1.000000e-02 --depth_correction_tau=1.000000e+04 --q_c_expectation_mode=hybrid --max_bias_factors=5 --psi_t_scale=1.000000e-03 --log_mean_bias_std=1.000000e-01 --init_ard_rel_unexplained_variance=1.000000e-01 --num_gc_bins=20 --gc_curve_sd=1.000000e+00 --active_class_padding_hybrid_mode=50000 --enable_bias_factors=True --disable_bias_factors_in_active_class=False --p_alt=1.000000e-06 --cnv_coherence_length=1.000000e+04 --max_copy_number=5 --p_active=0.010000 --class_coherence_length=10000.000000 --learning_rate=1.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.900000e-01 --log_emission_samples_per_round=50 --log_emission_sampling_rounds=10 --log_emission_sampling_median_rel_error=5.000000e-03 --max_advi_iter_first_epoch=5000 --max_advi_iter_subsequent_epochs=200 --min_training_epochs=10 --max_training_epochs=50 --initial_temperature=1.500000e+00 --num_thermal_advi_iters=2500 --convergence_snr_averaging_window=500 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=10 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=1.000000e+00 --disable_caller=false --disable_sampler=false --disable_annealing=false
Stdout: 17:31:06.046 INFO cohort_denoising_calling - THEANO_FLAGS environment variable has been set to: device=cpu,floatX=float64,optimizer=fast_run,compute_test_value=ignore,openmp=true,blas.ldflags=-lmkl_rt,openmp_elemwise_minsize=10
17:31:16.537 INFO gcnvkernel.io.io_intervals_and_counts - The given interval list provides the following interval annotations: {'GC_CONTENT'}
17:31:23.180 INFO cohort_denoising_calling - Loading 44 read counts file(s)...
17:34:27.362 INFO gcnvkernel.io.io_metadata - Loading germline contig ploidy and global read depth metadata...
17:40:48.713 INFO gcnvkernel.tasks.task_cohort_denoising_calling - Instantiating the denoising model (warm-up)...

Stderr:
        at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
        at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:340)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

Thanks for any help. Stefan

ldgauthier commented 3 years ago

We admittedly don't do a great job propagating the python logging back through java.

@asmirnov239 do you have any suggestions for debugging?

asmirnov239 commented 3 years ago

@stefandiederich Could you try using count files in TSV format (instead of HDF5) and see if that works? CollectReadCounts tool has an option to output TSV files.

If it doesn't, I would need some more information about the inputs to see what's going on.

Thanks!

stefandiederich commented 3 years ago

@ldgauthier @asmirnov239 Thanks for your quick reply! In the meantime I investigated a little more by coping the temp dir before the Software exited and started the python command by its own. With that I got the error Segmentation fault (core dumped). With googling that I got to the point, that it might be some sort of memory problem. The server has 512 GB and I did not see mor than 10 percent used until the GermlineCNVCaller stops but I did Scatter the Genome interval_list into 5 parts and run them seperatly. With success. The jobs are running right now for about 24h. But I think thats normal for 44 WGS samples and 5 scatters.

One other question: We also have 4x A100 GPUs from NVIDIA in that machine. Is the algorithm somehow able to use them?

Bests Stefan

asmirnov239 commented 3 years ago

@stefandiederich Great, did the jobs finish? For the reference we usually scatter in blocks of size 12500 for which we usually end up needing 16GB of RAM.

We do not support running on gCNV on GPUs, although in theory the underlying library Theano that we use supports it, although there might be some CUDA/Theano incompatibility issues with newest versions of CUDA.

mwalker174 commented 3 years ago

What environment are you running in? I would highly suggest using the GATK docker if you aren't already, and also increasing the number of shards (~10000-20000 intervals per shard as @asmirnov239 suggested).

stefandiederich commented 3 years ago

The server we are using has 64 cores an 512GB RAM. But the joub did not finish up to now. So I will stop that now and usre more scatters like @mwalker174 @asmirnov239 suggested.

Bests Stefan

gokalpcelik commented 3 years ago

This problem is about the skylake and later cpu incompatibility with one of the ML libraries used by the python environment. The only solution is to fall back to 4.1.7.0 docker image and use that for gCNV. This problem still persists and there is even one issue that I opened sometime ago for 4.1.9.0.

https://gatk.broadinstitute.org/hc/en-us/community/posts/360075117572-GermlineCNVCaller-edge-case-