WGLab / LinkedSV

MIT License
20 stars 8 forks source link

Targeted mode error: TypeError: a bytes-like object is required, not 'str' (cluster_reads.py) #19

Closed jthmiller closed 3 years ago

jthmiller commented 3 years ago

Hello, I am running into an error in targeted mode that I can't seem to get around:

python3 ./linkedsv.py -i $bam -d $out/$SAMPLE -r $refs/genome.fa -v hg19 -t 10 --somatic_mode --targeted --target_region $arBed

[04/19/2021 16:22:09 (123.757 MB)] sorting bam file by barcode [04/19/2021 16:22:09 (123.757 MB)] running command: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/output_bam_coreinfo /scratch.global/jtmiller/target10x_results/ results/results_longranger/LUCaP105P47/outs/phased_possorted_bam.bam | samtools sort -l 1 -m 1G -@ 10 -t BX -o /scratch.global/jtmiller/target10x_results/results/results_linkedSV/LUC aP105P47/phased_possorted_bam.bam.sortbx.bam - [bam_sort_core] merging from 0 files and 10 in-memory blocks... [04/19/2021 16:23:48 (123.818 MB)] extracting barcode info from bam file [04/19/2021 16:23:48 (123.818 MB)] running command: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/extract_barcode_info /scratch.global/jtmiller/target10x_results /results/results_linkedSV/LUCaP105P47/phased_possorted_bam.bam.sortbx.bam STDOUT /scratch.global/jtmiller/target10x_results/results/results_linkedSV/LUCaP105P47/phasedpossorted bam.bam.barcode_statistics | /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/pigz --fast --processes 9 - > /scratch.global/jtmiller/target10x_results/results/resul ts_linkedSV/LUCaP105P47/phased_possorted_bam.bam.bcd21.gz [04/19/2021 16:24:42 (123.818 MB)] extracting low mapq bcd21 [04/19/2021 16:26:01 (124.330 MB)] clustering reads [04/19/2021 16:26:01 (124.330 MB)] /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/cluster_reads /scratch.global/jtmiller/target10x_results/results/results_linkedS V/LUCaP105P47/phased_possorted_bam.bam.bcd21.gz /scratch.global/jtmiller/target10x_results/results/results_linkedSV/LUCaP105P47/phased_possorted_bam.bam.bcd22 /scratch.global/jtmille r/target10x_results/results/results_linkedSV/LUCaP105P47/phased_possorted_bam.bam.weird_reads.txt 0 -1 20 10 started round 1 clustering finished round 1 clustering time used is 34.79 seconds mean and sd of inner size: 131, 156 quantiles of inner size: 0.1% => -111.0000 1% => -73.0000 5% => -11.0000 10% => 18.0000 25% => 66.0000 50% => 131.0000 75% => 227.0000 90% => 345.0000 95% => 426.0000 99% => 592.0000 99.9% => 1846.0000

quantiles of gap distance: 0.1% => 1.0000 1% => 2.0000 5% => 9.0000 10% => 18.0000 25% => 50.0000 50% => 140.0000 75% => 452.0000 90% => 1249.0000 95% => 2196.0000 99% => 7643.0000 99.9% => 24073.0000

inner size cut-off = 592 gap distance cut-off = 7643 started round 2 clustering finished round 2 clustering time used is 37.36 seconds total number of reads is: 18237014 total number of weird reads is: 496409 [04/19/2021 16:27:16 (124.330 MB)] searching for extremely high coverage region [04/19/2021 16:27:48 (126.538 MB)] calculating distribution parameters [04/19/2021 16:27:48 (126.538 MB)] total number of reads in the genome is: 19425963 [04/19/2021 16:27:48 (126.575 MB)] running command: bedtools intersect -u -a /scratch.global/jtmiller/target10x_results/results/results_linkedSV/LUCaP105P47/phased_possorted_bam.bam.bcd21.gz -b /scratch.global/jtmiller/target10x_results/results/results_linkedSV/LUCaP105P47/AR_all.sorted.merged.bed.tidbed | gzip --fast - > /scratch.global/jtmiller/target10x_results/results/results_linkedSV/LUCaP105P47/phased_possorted_bam.bam.on_target.bcd21.gz Traceback (most recent call last): File "./linkedsv.py", line 313, in main() File "./linkedsv.py", line 47, in main detect_increased_fragment_ends(args, dbo_args, endpoint_args) File "./linkedsv.py", line 208, in detect_increased_fragment_ends global_distribution.estimate_global_distribution(args, dbo_args, endpoint_args, endpoint_args.bcd22_file) File "/panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/global_distribution.py", line 128, in estimate_global_distribution args.num_reads_ontarget = cluster_reads.calculate_num_reads_from_bcd21_file(endpoint_args.bcd21_file_of_target_region, args.min_mapq) File "/panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/cluster_reads.py", line 307, in calculate_num_reads_from_bcd21_file line = line.strip().split(tab) TypeError: a bytes-like object is required, not 'str'

Thanks for your work!

fangli80 commented 3 years ago

Could you please try to run it in python2? I have not tested it in python 3 yet. (for a temporary solution)

jthmiller commented 3 years ago

python2 -V Python 2.7.15 :: Anaconda, Inc.

python2 ./linkedsv.py -i $bam -d $out/$SAMPLE -r $refs/genome.fa -v hg19 -t 10 --somatic_mode --targeted --target_region $arBed

[04/21/2021 07:38:05 (123.339 MB)] sorting bam file by barcode [04/21/2021 07:38:05 (123.339 MB)] running command: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/output_bam_coreinfo /scratch.global/jtmiller/target10x_results/results/results_longranger/LNCaP863P16S33 P47/outs/phased_possorted_bam.bam | samtools sort -l 1 -m 1G -@ 10 -t BX -o /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.sortbx.bam - [bam_sort_core] merging from 10 files and 10 in-memory blocks... [04/21/2021 07:41:39 (123.339 MB)] extracting barcode info from bam file [04/21/2021 07:41:39 (123.339 MB)] running command: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/extract_barcode_info /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP8 63P16S33P47/phased_possorted_bam.bam.sortbx.bam STDOUT /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.barcode_statistics | /panfs/roc/groups/0/lmn p/jtmiller/programs/LinkedSV/scripts/../bin/pigz --fast --processes 9 - > /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd21.gz [04/21/2021 07:43:09 (123.339 MB)] extracting low mapq bcd21 [04/21/2021 07:46:56 (123.785 MB)] clustering reads [04/21/2021 07:46:56 (123.785 MB)] /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/cluster_reads /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_posso rted_bam.bam.bcd21.gz /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd22 /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targete d/LNCaP863P16S33P47/phased_possorted_bam.bam.weird_reads.txt 0 -1 20 10 started round 1 clustering finished round 1 clustering time used is 56.00 seconds mean and sd of inner size: 102, 129 quantiles of inner size: 0.1% => -111.0000 1% => -100.0000 5% => -65.0000 10% => -35.0000 25% => 27.0000 50% => 102.0000 75% => 187.0000 90% => 272.0000 95% => 333.0000 99% => 522.0000 99.9% => 1668.0000

quantiles of gap distance: 0.1% => 1.0000 1% => 3.0000 5% => 11.0000 10% => 22.0000 25% => 64.0000 50% => 202.0000 75% => 747.0000 90% => 2415.0000 95% => 5844.0000 99% => 16159.0000 99.9% => 34787.0000

inner size cut-off = 522 gap distance cut-off = 16159 started round 2 clustering finished round 2 clustering time used is 67.53 seconds total number of reads is: 31519602 total number of weird reads is: 680877 [04/21/2021 07:49:03 (123.785 MB)] searching for extremely high coverage region [04/21/2021 07:50:57 (128.831 MB)] calculating distribution parameters [04/21/2021 07:50:57 (128.831 MB)] total number of reads in the genome is: 32573971 [04/21/2021 07:50:57 (129.024 MB)] running command: bedtools intersect -u -a /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd21.gz -b /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/AR_all.sorted.merged.bed.tidbed | gzip --fast - > /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.on_target.bcd21.gz
[04/21/2021 07:52:30 (129.024 MB)] calculating fragment parameters from file: /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd22 [04/21/2021 07:52:32 (129.032 MB)] N95_fragment_length is: 5232 [04/21/2021 07:52:39 (190.058 MB)] finished getting fragment parameters [04/21/2021 07:52:39 (189.280 MB)] searching for paired breakpoints [04/21/2021 07:52:39 (189.280 MB)] searching paired breakpoints [04/21/2021 07:52:39 (189.280 MB)] building nodes from fragments [04/21/2021 07:52:39 (189.280 MB)] reading bcd22 file:/scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd22 [04/21/2021 07:52:40 (259.056 MB)] total number of fragments: 102765 [04/21/2021 07:52:40 (259.682 MB)] writing to node file [04/21/2021 07:52:41 (167.055 MB)] removing sparse nodes, min_support_fragments is 10 [04/21/2021 07:52:41 (167.055 MB)] Running CMD: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/remove_sparse_nodes /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node33 /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node33.candidates 25000 /home/dehms/jtmiller/targeted_10x_AR_2.0/references/genome.fa.fai 10 [04/21/2021 07:53:24 (167.055 MB)] Running CMD: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/remove_sparse_nodes /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node55 /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node55.candidates 25000 /home/dehms/jtmiller/targeted_10x_AR_2.0/references/genome.fa.fai 10 [04/21/2021 07:54:06 (167.055 MB)] Running CMD: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/remove_sparse_nodes /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node35 /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node35.candidates 25000 /home/dehms/jtmiller/targeted_10x_AR_2.0/references/genome.fa.fai 10 [04/21/2021 07:54:48 (167.055 MB)] Running CMD: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/remove_sparse_nodes /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node53 /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node53.candidates 25000 /home/dehms/jtmiller/targeted_10x_AR_2.0/references/genome.fa.fai 10 [04/21/2021 07:55:31 (167.055 MB)] clustering nodes, max distance for connecting two nodes is: 25000 [04/21/2021 07:55:31 (167.055 MB)] min support fragment pairs is: 10 [04/21/2021 07:55:31 (167.055 MB)] reading black region bed file [04/21/2021 07:55:31 (167.055 MB)] reading node candidate file:/scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node33.candidates [04/21/2021 07:55:31 (167.055 MB)] number of nodes in blacklist region: 0 [04/21/2021 07:55:31 (167.055 MB)] number of nodes in node candidate file: 0 [04/21/2021 07:55:31 (167.055 MB)] min support fragment pairs is: 10 [04/21/2021 07:55:31 (167.055 MB)] reading black region bed file [04/21/2021 07:55:31 (167.055 MB)] reading node candidate file:/scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node55.candidates [04/21/2021 07:55:31 (167.055 MB)] number of nodes in blacklist region: 0 [04/21/2021 07:55:31 (167.055 MB)] number of nodes in node candidate file: 0 [04/21/2021 07:55:31 (167.055 MB)] min support fragment pairs is: 10 [04/21/2021 07:55:31 (167.055 MB)] reading black region bed file [04/21/2021 07:55:31 (167.055 MB)] reading node candidate file:/scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node53.candidates [04/21/2021 07:55:31 (167.055 MB)] number of nodes in blacklist region: 0 [04/21/2021 07:55:31 (167.055 MB)] number of nodes in node candidate file: 0 [04/21/2021 07:55:31 (167.055 MB)] min support fragment pairs is: 10 [04/21/2021 07:55:31 (167.055 MB)] reading black region bed file [04/21/2021 07:55:31 (167.055 MB)] reading node candidate file:/scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.node35.candidates 04/21/2021 07:55:31 (167.055 MB)] number of nodes in blacklist region: 0 [04/21/2021 07:55:31 (167.055 MB)] number of nodes in node candidate file: 0 [04/21/2021 07:55:33 (167.055 MB)] number of candidate fragments: 0 [04/21/2021 07:55:34 (167.055 MB)] number of candidate fragments: 0 [04/21/2021 07:55:36 (167.055 MB)] number of candidate fragments: 0 [04/21/2021 07:55:37 (167.055 MB)] number of candidate fragments: 0 [04/21/2021 07:55:37 (167.055 MB)] calculating read depth [04/21/2021 07:55:37 (167.055 MB)] running command: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/cal_read_depth_from_bcd21 /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd21.gz /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.read_depth.txt /home/dehms/jtmiller/targeted_10x_AR_2.0/references/genome.fa.fai 100 20 [04/21/2021 07:56:49 (167.055 MB)] finished calculating read depth [04/21/2021 07:56:49 (167.055 MB)] counting overlapping barcodes between twin windows [04/21/2021 07:56:49 (167.055 MB)] running command: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/cal_twin_win_bcd_cnt /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd21.gz /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd11 /home/dehms/jtmiller/targeted_10x_AR_2.0/references/genome.fa.fai 100 40000 20 processed 1000000 barcodes [04/21/2021 08:02:30 (167.055 MB)] finished counting overlapping barcodes between twin windows [04/21/2021 08:02:30 (167.055 MB)] calculating centroid [04/21/2021 08:02:30 (167.055 MB)] running command: /panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/../bin/cal_centroid_from_read_depth /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.read_depth.txt /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd11 /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd12 /home/dehms/jtmiller/targeted_10x_AR_2.0/references/genome.fa.fai bin_size = 100 reading read depth file: /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.read_depth.txt finished reading read depth file: /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.read_depth.txt processed 10000000 windows processed 20000000 windows processed 30000000 windows [04/21/2021 08:12:24 (167.055 MB)] finished calculating centroid [04/21/2021 08:12:24 (167.055 MB)] calculating barcode similarity and p-value [04/21/2021 08:12:24 (167.055 MB)] calculating barcode similarity and p-value [04/21/2021 08:12:24 (167.055 MB)] reading bcd12 file: /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd12 [04/21/2021 08:13:51 (2.834 GB)] finished reading bcd12 files [04/21/2021 08:14:16 (2.834 GB)] m1_mean, m1_std_r, m1_std_l: 92.000000, 118.000000, 53.000000 [04/21/2021 08:14:16 (2.834 GB)] m2_mean, m2_std_r, m2_std_l: 92.000000, 118.000000, 53.000000 [04/21/2021 08:14:16 (2.834 GB)] min_m1_value, min_m2_value, max_m1_value, max_m2_value: 31, 31, 276, 276 [04/21/2021 08:14:18 (2.167 GB)] reading bcd12 file: /scratch.global/jtmiller/target10x_results/results/results_linkedSV/targeted/LNCaP863P16S33P47/phased_possorted_bam.bam.bcd12 [04/21/2021 08:15:47 (2.651 GB)] finished reading bcd12 files [04/21/2021 08:15:47 (2.651 GB)] number of windows passed filtering: 15021643 (47.88 %) [04/21/2021 08:16:01 (3.132 GB)] fitting model
[04/21/2021 08:16:06 (3.759 GB)] finished fitting model [04/21/2021 08:16:06 (3.759 GB)] finished fitting model [04/21/2021 08:16:45 (4.255 GB)] Y_mean, Y_std_r, Y_std_l: -0.035697, 0.954604, 0.860889 [04/21/2021 08:16:49 (3.771 GB)] calculating expected overlap barcode count Traceback (most recent call last): File "./linkedsv.py", line 313, in main() File "./linkedsv.py", line 50, in main detect_decreased_barcode_overlap(args, dbo_args, endpoint_args) File "./linkedsv.py", line 164, in detect_decreased_barcode_overlap cal_expected_overlap_value.cal_expected_overlap_bcd_cnt(dbo_args.bcd12_file, dbo_args.bcd13_file, is_wgs) File "/panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/cal_expected_overlap_value.py", line 124, in cal_expected_overlap_bcd_cnt predict_overlap_bcd_cnt(bcd12_file, out_file, a, b, alpha, logn, min_m1_value, min_m2_value, max_m1_value, max_m2_value, Y_percentile_array, bk_list, c, Y_mean, Y_std_r, Y_std_l, n_window_pairs) File "/panfs/roc/groups/0/lmnp/jtmiller/programs/LinkedSV/scripts/cal_expected_overlap_value.py", line 172, in predict_overlap_bcd_cnt Y = math.log(predicted_n_ovl_bcd / float(m0), 2) ValueError: math domain error

jthmiller commented 3 years ago

Should I close this issue and open another under python2 with the math domain error? Thanks,

fangli80 commented 3 years ago

No. I will check it today. Thanks!

jthmiller commented 3 years ago

Hello, I'd like to follow up on this. We've been able to get linkedSV to run great on our wgs data- but not yet on our targeted seq data. We are aiming to have matched the caller between datasets to compare them. If it would help, I may be able to provide a sample where the error occurs. Or, if the error is resolved, I'll initiate linkedSV again. Thanks,

fangli80 commented 3 years ago

Yes. Please provide a sample. You can email me the link to download. Thanks!

On Wed, May 5, 2021 at 8:17 AM Jeffrey Miller @.***> wrote:

Hello, I'd like to follow up on this. We've been able to get linkedSV to run great on our wgs data- but not yet on our targeted seq data. We are aiming to have matched the caller between datasets to compare them. If it would help, I may be able to provide a sample where the error occurs. Or, if the error is resolved, I'll initiate linkedSV again. Thanks,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WGLab/LinkedSV/issues/19#issuecomment-832641108, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKUNJAQC6E5H3RXM3IP6C3TMEZO3ANCNFSM43GX4J4A .

fangli80 commented 3 years ago

I think the bug is fixed. Please clone the latest version.

jthmiller commented 3 years ago

Thanks! LinkedSV ran to completion in targeted mode. I'll confirm output soon and follow up if needed. Thanks again.