DNAscent 4.0.3: both the singularity image and compiled binary are crashing on Ubuntu 22.04 system

ucsc-eshell commented 1 month ago

Hello,

I support 2 Ubuntu 22.04 workstations which both have working binaries of both DNAscent 4.0.1 and 4.0.2. However, DNAscent 4.0.3 is crashing when running the detect step. It appears to be the same behavior for both the compiled binary and the singularity image - the output is identical:

$ DNAscent -b /data/c/User/DEK_20240417_41-hmwDEK28/dorado/dorado-v0.6.2/20240606_0937/transfer-20240610_1340/sort.bam -r /home/user/Ref-Genomes/w303_0_User_v2-edited.fa -i index.dnascent -o detect_output.bam -t 10 --GPU 0

DNAscent-4.0.3: Relink /opt/DNAscent-4.0.3/tensorflow/lib/libtensorflow_framework.so.2' with/lib/x86_64-linux-gnu/libz.so.1' for IFUNC symbol `crc32_z' 2024-09-18 10:59:00.644027: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 Loading DNAscent index... ok. 2024-09-18 10:59:02.704221: I tensorflow/cc/saved_model/reader.cc:32] Reading SavedModel from: /opt/DNAscent-4.0.3/dnn_models/detect_model_BrdUEdU_DNAr10_4_1/ 2024-09-18 10:59:02.812844: I tensorflow/cc/saved_model/reader.cc:55] Reading meta graph with tags { serve } 2024-09-18 10:59:02.812886: I tensorflow/cc/saved_model/reader.cc:93] Reading SavedModel debug info (if present) from: /opt/DNAscent-4.0.3/dnn_models/detect_model_BrdUEdU_DNAr10_4_1/ 2024-09-18 10:59:02.812932: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-09-18 10:59:02.813051: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2024-09-18 10:59:02.813864: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2024-09-18 10:59:02.838834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:65:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6 coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 871.81GiB/s 2024-09-18 10:59:02.838882: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2024-09-18 10:59:02.847041: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2024-09-18 10:59:02.847118: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2024-09-18 10:59:02.882837: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2024-09-18 10:59:02.883088: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2024-09-18 10:59:02.883682: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2024-09-18 10:59:02.885809: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2024-09-18 10:59:02.885930: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2024-09-18 10:59:02.886191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2024-09-18 10:59:02.886208: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2024-09-18 10:59:03.386938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2024-09-18 10:59:03.386973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2024-09-18 10:59:03.386978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2024-09-18 10:59:03.387373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21391 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:65:00.0, compute capability: 8.6) 2024-09-18 10:59:03.722707: I tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle. 2024-09-18 10:59:03.772215: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2400000000 Hz 2024-09-18 10:59:04.229290: I tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /opt/DNAscent-4.0.3/dnn_models/detect_model_BrdUEdU_DNAr10_4_1/ 2024-09-18 10:59:04.517040: I tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 1812821 microseconds. Importing reference... ok. Opening bam file... ok. Scanning bam file...ok. [> ] 0% 0/11037 0hr 0min 0sec failed: Segmentation fault (core dumped)

Please let me know if I should collect any additional information.

MBoemo commented 1 month ago

A few quick checks:

Is the bam file okay? And can you reproduce this by taking a subset of the bam file?
There are a few tools that overwrite bam tags in nonstandard ways which can cause problems (e.g., https://github.com/MBoemo/DNAscent/issues/62). Is the bam file straight out of Dorado?

rstraver commented 1 month ago

Hi,

I'm a "new user" so I haven't tested previous versions, and as my data is pod5 format this version seemed the most sensible to use. For me it also crashes with a Segmentation fault (core dumped), so perhaps it is related. I tried both compiling from source and the Singularity container, and both CPU and GPU, same issue, this is a CPU output example:

DNAscent \
>     detect \
>     -b ./problemread.bam \
>     -r ../../../../raw/GCA_000001405.15_GRCh38_full_analysis_set.fna \
>     -i ./index.dnascent \
>     -o ./problemread_output.bam \
>     -t 1
2024-09-23 14:23:20.611765: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-09-23 14:23:20.611836: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Loading DNAscent index... ok.
2024-09-23 14:23:21.647860: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2024-09-23 14:23:22.121375: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3100000000 Hz
Importing reference... ok.
Opening bam file... ok.
Scanning bam file...ok.
Segmentation fault (core dumped)

I'm on a different OS though:

NAME="Rocky Linux"
VERSION="8.10 (Green Obsidian)"

For me the crash isn't just on the first read it processes, a bunch of them do work until it runs into a particular one that I hunted down, if you want I can supply a pod5 and bam file with the specific read that triggers the issue for me. It happens to be a read that has a different parentID than its own ID, so I wondered if there is an issue with that although when I browsed through the source code a little it seems you did implement something to take that into account, so I have no clue how to fix this. As far as I know my data came directly from Dorado so I don't think it is caused by another tool changing the bam tags or such.

MBoemo commented 1 month ago

Thanks very much @rstraver - if you could send me the bam and pod5 file that corresponds to the problematic read that would be extremely helpful. These split reads are taken into account but it's of course possible that there's a lingering issue that didn't trip on any of our test sets.

rstraver commented 1 month ago

Here it is, I hope it helps: segfault_read.zip

ucsc-eshell commented 1 month ago

@MBoemo Thank you for your prompt reply, sorry for my delayed response.

The bam file involved in my reported issue was indeed filtered/processed, and a raw bam file directly from dorado does not cause the segfault. In case it is helpful, the researcher reporting this to me reports that the same processed bam file appeared to work without issue on older versions of DNAscent.

I'll leave this issue open for now but I think my issue is resolved.

MBoemo commented 1 month ago

Thanks @ucsc-eshell that is indeed helpful. Sounds like it's the same underlying problem for the two of you. I'll try to take a look this week and push a fix.

dkastl1 commented 1 month ago

Hi,

To share some more details for your information (I was the user having problems that ucsc-eshell posted here):

Initially, I was trying to put in a processed bam file, which with further testing we have found cannot be done with this version of DNAscent. It's the same processing we had done with previous versions of DNAscent but something now crashes it (which is fine- just a FYI).

However- There is an argument that was given to us by someone at ONT to emit a fastq file from dorado rather than a mapped bam file so that dorado fit seamlessly into our established workflow. I then mapped that fastq file that was directly output from dorado and did no additional processing on this file before putting it into detect and it also causes a segfault crash. When I did the mapping with dorado such that I got a bam file directly from dorado, that's the file that worked with detect. So there's something about the fastq emission from dorado that the detect function has a problem with as well.

I tested processing the direct bam file from dorado in our pipeline (given my initial processed bam stemmed from the problematic fastq) and it still gave the segfault error. So whether the output is bam or fastq from dorado, either way detect crashes.

Hope this info is helpful!

MBoemo commented 1 month ago

Thanks very much @dkastl1, that's very helpful. I have a block of time tomorrow where I should be able to sort it out.

MBoemo commented 1 month ago

@rstraver thanks again for sending this read over, I've gotten to the bottom of this. Your read has a pi tag (split read, as you say) but no corresponding sp tag. That's a bit odd since from here if you have the former than you should also have the latter. It's easy to add some handling to pass over this which I'll do this afternoon, but I'd like to dig a little deeper here to see if this is some sort of Dorado issue which may have been resolved in an update. Which version of Dorado are you using?

MBoemo commented 1 month ago

Both the above issues are fixed now - just testing overnight and then I'll push an update tomorrow and update the image.

rstraver commented 1 month ago

From the pod5 file with the read I provided I could find these version numbers:

...
    sequencer_position_type: PromethION
    software: MinKNOW 24.02.19 (Bream 7.9.8, Core 5.9.12, Dorado 7.3.11+0112dde09)
...
    tracking_id: {'asic_id': '0004A30B010E0174', 'asic_id_eeprom': '0004A30B010E0174', 'asic_temp': '33.449776', 'asic_version': 'Unknown', 'configuration_version': '5.9.18', 'data_source': 'real_device', 'device_id': '1B', 'device_type': 'promethion', 'distribution_status': 'stable', 'distribution_version': '24.02.19', 'exp_script_name': 'sequencing/sequencing_PRO114_DNA_e8_2_400K:FLO-PRO114M:SQK-NBD114-24:400', 'exp_script_purpose': 'sequencing_run', 'exp_start_time': '2024-08-26T16:24:31.622313+02:00', 'flow_cell_id': 'PAW50812', 'flow_cell_product_code': 'FLO-PRO114M', 'guppy_version': '7.3.11+0112dde09', 'heatsink_temp': '34.055954', 'host_product_code': 'PRO-PRCA100', 'host_product_serial_number': 'PCA100325', 'hostname': 'PCA100325', 'hublett_board_id': '01307b33778ba51b', 'hublett_firmware_version': '2.1.10', 'installation_type': 'nc', 'is_simulated': '0', 'operating_system': 'ubuntu 20.04', 'protocol_group_id': '240826_ligation_BC_P3471', 'protocol_run_id': '7e98cdf1-60ad-4e8e-9ec7-cd9edc26462e', 'protocol_start_time': '2024-08-26T16:22:47.707506+02:00', 'protocols_version': '7.9.8', 'run_id': '143d496a299c80071a8d461f855ce5b5abfd0c41', 'sample_id': 'I24-1282-01_I24-1282-02', 'satellite_board_id': '013c617870d9d33d', 'satellite_firmware_version': '2.3.0', 'sequencer_hardware_revision': '', 'sequencer_product_code': 'PRO-SEQ024', 'sequencer_serial_number': '', 'usb_config': 'fx3_0.0.0#fpga_0.0.0#unknown#unknown', 'version': '5.9.12'}

Thanks for looking into it, hope this is permanently resolved now.

MBoemo commented 1 month ago

Should be fixed now. If compiling from source, pull from master and recompile. If using the singularity image, a fixed one has been uploaded.

dkastl1 commented 1 month ago

Thanks so much! I just tried running our processed bam file and it worked initially but is crashing once it reaches the 28th read.

DNAscent-4.0.3: Relink `/opt/DNAscent-4.0.3/tensorflow/lib/libtensorflow_framework.so.2' with `/lib/x86_64-linux-gnu/libz.so.1' for IFUNC symbol `crc32_z'
2024-09-26 12:22:26.202313: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Loading DNAscent index... ok.
2024-09-26 12:22:28.274198: I tensorflow/cc/saved_model/reader.cc:32] Reading SavedModel from: /opt/DNAscent-4.0.3/dnn_models/detect_model_BrdUEdU_DNAr10_4_1/
2024-09-26 12:22:28.387851: I tensorflow/cc/saved_model/reader.cc:55] Reading meta graph with tags { serve }
2024-09-26 12:22:28.387890: I tensorflow/cc/saved_model/reader.cc:93] Reading SavedModel debug info (if present) from: /opt/DNAscent-4.0.3/dnn_models/detect_model_BrdUEdU_DNAr10_4_1/
2024-09-26 12:22:28.387935: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-26 12:22:28.388102: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2024-09-26 12:22:28.388956: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2024-09-26 12:22:28.410975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:65:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 871.81GiB/s
2024-09-26 12:22:28.411019: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2024-09-26 12:22:28.418642: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2024-09-26 12:22:28.418692: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2024-09-26 12:22:28.451996: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2024-09-26 12:22:28.452289: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2024-09-26 12:22:28.452935: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2024-09-26 12:22:28.455148: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2024-09-26 12:22:28.455288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2024-09-26 12:22:28.455564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2024-09-26 12:22:28.455583: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2024-09-26 12:22:28.954799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-09-26 12:22:28.954828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2024-09-26 12:22:28.954833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2024-09-26 12:22:28.955265: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21391 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:65:00.0, compute capability: 8.6)
2024-09-26 12:22:29.312395: I tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
2024-09-26 12:22:29.373486: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2400000000 Hz
2024-09-26 12:22:29.847384: I tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /opt/DNAscent-4.0.3/dnn_models/detect_model_BrdUEdU_DNAr10_4_1/
2024-09-26 12:22:30.130902: I tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 1856705 microseconds.
Importing reference... ok.
Opening bam file... ok.
Scanning bam file...ok.
2024-09-26 12:22:35.870072: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2024-09-26 12:22:36.766506: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2024-09-26 12:22:36.811358: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2024-09-26 12:22:36.843985: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2024-09-26 12:22:39.424837: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
DNAscent-4.0.3: src/event_handling.cpp:547: void normaliseEvents(DNAscent::read&, bool): Assertion `et.n > 0' failed.
Aborted (core dumped)

MBoemo commented 1 month ago

@dkastl1 very sorry this is still an issue. If possible, are you able to send me the read that's causing the problem?

Also, are you still basecalling with Dorado to fastq, aligning, and then running DNAscent detect? If so, DNAscent definitely shouldn't crash (and we want to fix it if it does) but in terms of performance we would strongly recommend bascalling directly to bam unless there's a very good reason not to.

MBoemo / DNAscent

DNAscent 4.0.3: both the singularity image and compiled binary are crashing on Ubuntu 22.04 system #67