Closed abhijitcbio closed 1 year ago
Thank you! As mentioned in the README, this file is generated based on examining the SNParray-based copy number calls for TCGA tumors and deciding on the probable boundaries of telomere and centromere for each chromosome arm. The underlying logic is that we expect higher breakpoint density in telomeric and centromeric regions and we used that to determine approximate start and end of p
and q
arms of each chromosome.
If you are trying to make another version of this file, I believe chromosome_info
, p_start
, p_end
, q_start
, and q_end
are the columns needed for main calculations. The remaining columns that are used for making some plots after the main peak finding calculations are q_start1
, and q_end1
, offset
, and middle_of_arm
. They are basically converting the coordinates as if chromosome are stacked one after the other from chromosome 1 to 22 (chromosome 2 starts where chromosome 1 ends etc.). I believe the rest of columns are not used at all anywhere in the code.
And for chromosome X, it's not looked at by the method. Sorry for the confusion, I should remove that. And the reason is that getting good copy number calls for chromosome X is difficult due to sex differences (and probably other reasons) and that can interfere with the analysis on the rest of the chromosomes when you build the background distributions. If you believe you have good copy number data for chromosome X, you need to modify the code so it includes chromosome X segments in the analysis.
Hi, Thanks for your quick reply. I have created the version you suggested
chromosome_info p_start p_end q_start q_end
1 0 121700000 125100000 248956422
10 0 38000000 41600000 133797422
11 0 51000000 55800000 135086622
12 0 33200000 37800000 133275309
13 0 16500000 18900000 114364328
14 0 16100000 18200000 107043718
15 0 17500000 20500000 101991189
16 0 35300000 38400000 90338345
17 0 22700000 27400000 83257441
and successfully ran the python3.10 BISCUT_preprocessing.py
part. But, I am getting the following error while running the Rscript BISCUT_peak_finding.R
code.
[1] 8
[1][1] "2p Cancer amp tel"
"3p Cancer amp tel"
[1] "1p Cancer amp tel"
[1][1] "4p Cancer amp tel"
"5p Cancer amp tel"
[1] "7p Cancer amp tel"
[1] "6p Cancer amp tel"
[1] "9p Cancer amp tel"
[1] "10p Cancer amp tel"
[1] "11p Cancer amp tel"
[1] "8p Cancer amp tel"
[1] "16p Cancer amp cent"
[1] "12p Cancer amp tel"
[1] "18p Cancer amp tel"
[1] "16p Cancer amp tel"
[1] "17p Cancer amp tel"
[1] "13 Cancer amp cent"
[1] "14 Cancer amp cent"
[1] "20p Cancer amp tel"
[1] "19p Cancer amp tel"
[1] "15 Cancer amp cent"
[1] "one-sample ks p-value is 0.717160548429896, ks stat is 0.3"
[1] "1q Cancer amp cent"
[1] "one-sample ks p-value is 0.876160990712074, ks stat is 0.266666666666667"
[1] "21 Cancer amp cent"
[1] "22 Cancer amp cent"
[1] "3q Cancer amp cent"
[1] "4q Cancer amp cent"
[1] "2q Cancer amp cent"
[1] "1q Cancer amp tel"
[1] "21 Cancer amp tel"
[1][1] "11q Cancer amp cent"
"12q Cancer amp cent"
[1] "one-sample ks p-value is 0.549508366396408, ks stat is 0.383333333333333"
[1] "9q Cancer amp cent"
[1] "8q Cancer amp cent"
[1] "10q Cancer amp cent"
[1] "12q Cancer amp tel"
[1] "11q Cancer amp tel"
[1] "8q Cancer amp tel"
[1] "10q Cancer amp tel"
[1] "20q Cancer amp cent"
[1] "19q Cancer amp cent"
[1] "19q Cancer amp tel"
[1] "5q Cancer amp cent"
[1] "6q Cancer amp cent"
[1] "16q Cancer amp cent"
[1] "17q Cancer amp cent"
[1] "7q Cancer amp cent"
[1] "16q Cancer amp tel"
[1] "18q Cancer amp cent"
[1] "18q Cancer amp tel"
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
...
[[38]]
NULL
[[39]]
NULL
Error in py_call_impl(callable, call_args$unnamed, call_args$named) :
ValueError: No objects to concatenate
Run `reticulate::py_last_error()` for details.
Calls: all_processing -> py_call_impl
Execution halted
Any guess why it broke? Thanks for your help.
Do you get error when you use the version of chromosome coordinates provided by us, or only when you use what you created?
My data is mapped on hg38, so can't run with the coordinates provided by you. This was generated by me using the hg38 Cytoband file https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz
It's okay, you can still use our coordinates. Again as explained in the README, it's less important which reference genome you are using than a conservative approach to define telomeric and centromeric region. For example in your file there is no telomere so all short events in the chromosome arm ends are included in the calculation and will result in false signal. That's why we have chosen those boundaries from inspecting copy number segments first to make sure those regions are masked. And, there shouldn't be significantly large differences between hg19 and hg38 coordinates anyway so I would still suggest using our coordinates. When I ran BISCUT on hg38-based data, I only modified the gene list file hg19_genes_refseq_telcentfiltered_020518.txt
to one based on hg38. This file is basically used to generate the list of genes that fall within the identified selection peaks for the output and is not a fundamental part of the algorithm as oppose to coordinates that can change the breakpoints and the results.
The reason I asked to use our coordinates is to make sure the error is not from that because two other users reported similar errors and I need to understand the reason to be able to better help you. Also, is the coordinate file you are using complete? The part you pasted above doesn't have all chromosomes from 1 to 22.
So, as you suggested I have kept the coordinate file the same and used the hg38 gene list, this the peak finding code -
thedate = '2023_08_04'
tumor_type <- 'Cancer'
ci <- 0.95 # confidence interval threshold around peaks, larger -> wider peaks
qval_thres <- 0.05 # q-value threshold for calling peaks after correcting for multiple hypothesis testing
telcent_thres <- 10^-3 # filtering events too close to tel/cent or whole arm
USE_BACKGROUND <- FALSE # Whether to use background distribution from TCGA samples, or build one from the current data
n <- 1000 # number of samples for ci distribution
bgdist <- 'beta'
n_cores <- 8 # numbers of CPU cores for parallelization over chromosome arms
set.seed(123456789) #random seed
genelocs_file <- 'docs/hg38_genes_refseq_telcentfiltered.txt'
genelocs <- read.csv(genelocs_file,sep='\t',header=T)
abslocs_file <- 'docs/SNP6_hg19_chromosome_locs_200605.txt'
abslocs <- read.table(abslocs_file,sep='\t',header=T)
tel_background_file <- 'docs/Cancer_tels.txt'
cent_background_file <- 'docs/Cancer_cents.txt'
But I am still getting this error -
Error in py_call_impl(callable, call_args$unnamed, call_args$named) :
ValueError: No objects to concatenate
Run `reticulate::py_last_error()` for details.
Calls: all_processing -> py_call_impl
Execution halted
Also this is the full hg38 coordinate file I was using previously -
chromosome_info p_start p_end q_start q_end
1 0 121700000 125100000 248956422
10 0 38000000 41600000 133797422
11 0 51000000 55800000 135086622
12 0 33200000 37800000 133275309
13 0 16500000 18900000 114364328
14 0 16100000 18200000 107043718
15 0 17500000 20500000 101991189
16 0 35300000 38400000 90338345
17 0 22700000 27400000 83257441
18 0 15400000 21500000 80373285
19 0 24200000 28100000 58617616
2 0 91800000 96000000 242193529
20 0 25700000 30400000 64444167
21 0 10900000 13000000 46709983
22 0 13700000 17400000 50818468
3 0 87800000 94000000 198295559
4 0 48200000 51800000 190214555
5 0 46100000 51400000 181538259
6 0 58500000 62600000 170805979
7 0 58100000 62100000 159345973
8 0 43200000 47200000 145138636
9 0 42200000 45500000 138394717
So you get the error when running all_processing()
function which tries to read files and concatenate them. Do you have any .txt
file immediately under your tumor_type
directory which seems to be named Cancer
?
Also, what is the header row of your gene list? They should be Chr Start End Cytoband Gene RefSeqName
So this is what I have under the results folder
results_2023_08_04_0.95/
├── Cancer
│ ├── Cancer_21_amp_cent_p_0.95_iter1.pdf
│ └── summary
├── KS_pvalues_2023_08_04_0.95.txt
└── stats
├── Cancer_21_amp_cent_p_1.rds
└── summary
So, no .txt file under Cancer, and my gene list looks like this
Chr Start End Cytoband Gene RefSeqName
1 65419 71585 p36.33 OR4F5 ENSG00000186092.6
1 450703 451697 p36.33 OR4F29 ENSG00000284733.1
1 685679 686673 p36.33 OR4F16 ENSG00000284662.1
I have used the Ensembl ids instead of ref-seq ones.
If the peak finding step works you should get many .txt
files which summarize the identified peaks. It could be there are errors that are caught by tryCatch
. Can you change the line 639 from:
bar <- tryCatch(do_arm_gistic(a,tumor_type,d,tc,mode='overlap'),error=errorfunc)
to:
bar <- do_arm_gistic(a,tumor_type,d,tc,mode='overlap')
and see if you get any error when running the for loop at line 635? If not, try to do the same with lines 616 and 617 and run it again. This might help to see where the error is.
So, I changed the line 639 as you suggested and got this error -
[1] 8
[1][1][1] "13 Cancer amp cent"
"2q Cancer amp cent" "1q Cancer amp cent"
[1][1][1][1][1] "21 Cancer amp cent" "15 Cancer amp cent" "22 Cancer amp cent" "3q Cancer amp cent"
"14 Cancer amp cent"
[1] "one-sample ks p-value is 0.00466200466200217, ks stat is 0.833333333333333"
[1][1][1][1] "4q Cancer amp cent" "5q Cancer amp cent"[1] "6q Cancer amp cent"
"8q Cancer amp cent"
"9q Cancer amp cent"
[1] "11q Cancer amp cent"
[1] "10q Cancer amp cent"
[1][1][1] "19q Cancer amp cent" "12q Cancer amp cent" "17q Cancer amp cent"
[1] "16q Cancer amp cent"
[1] "20q Cancer amp cent"
[1] "original peak 0.970035372074872"
[1] "7q Cancer amp cent"
[1] "18q Cancer amp cent"
Error in { : task 1 failed - "$ operator is invalid for atomic vectors"
Calls: %dopar% -> <Anonymous>
Execution halted
I think the parallel for loop is masking the informative error message. Can you change line 635 from:
foreach(i=1:length(all_arms)) %dopar% {
to:
for(i in 1:length(all_arms)) {
and re-run it?
It gives you the following error
Error in tabprobes$Var1 : $ operator is invalid for atomic vectors
Calls: do_arm_gistic
In addition: Warning messages:
1: In ifelse(is.na(as.integer(arm)), as.integer(substr(arm, 1, nchar(arm) - :
NAs introduced by coercion
2: In ifelse(is.na(as.integer(arm)), substr(arm, nchar(arm), nchar(arm) + :
NAs introduced by coercion
Execution halted
I tried to print this -
tabprobes <- tablify(probes,chromosome,pq,telcent,F)
print (tabprobes)
It gives you the following result
integer(0)
Okay thank you for trying. I don't think the error we are looking for is the one above. I need to take a closer look at this, do you want to meet over zoom?
Here is my email abhijit@lji.org, we can talk more about that.
Hi Abhijit,
I think I found the problem, it's so silly! It's the print statement at line 594 print(left_boundary, right_boundary)
which for some reason gives error in the latest version of R! I had to upgrade to R 4.3.1 to be able to reproduce the error, it was fine with the R version I had tested BISCUT with and recommended to use. I just made a quick fix by commenting that line out. Please pull and let me know if the error is resolved on TCGA data. If it worked on TCGA data but failed on your data, you can open another issue. Thank you for your help and patience.
Hi,
Thanks for developing this tool. How did you generate the SNP6_hg19_chromosome_locs_200605.txt file? Is there any script? Could you please explain the columns of the file? especially for chrX with so few columns?