beroukhim-lab / BISCUT-py3

9 stars 5 forks source link

How to create chromosome_coordinates file #2

Closed abhijitcbio closed 1 year ago

abhijitcbio commented 1 year ago

Hi,

Thanks for developing this tool. How did you generate the SNP6_hg19_chromosome_locs_200605.txt file? Is there any script? Could you please explain the columns of the file? especially for chrX with so few columns?

chromosome_info size    q_start1        q_end1  offset  centromere      cent_mod        p_start p_end   q_start q_end   middle_of_arm
1       249250621       125000000       249250621       0       125000000       0       3218610 120527361       149879545       247813706       124625310.5
2       243199373       342550621       492449994       249250621       93300000        0       484222  89130753        95341700        242476062       370850307.5
3       198022430       583449994       690472424       492449994       91000000        0       2212571 90485962        93734671        197538677       591461209
4       191154276       740872424       881626700       690472424       50400000        0       1053934 48068826        52698757        188763651       786049562
....
23                                                      3157107 58317986        62172719        154905589
shahab-sarmashghi commented 1 year ago

Thank you! As mentioned in the README, this file is generated based on examining the SNParray-based copy number calls for TCGA tumors and deciding on the probable boundaries of telomere and centromere for each chromosome arm. The underlying logic is that we expect higher breakpoint density in telomeric and centromeric regions and we used that to determine approximate start and end of p and q arms of each chromosome.

If you are trying to make another version of this file, I believe chromosome_info, p_start, p_end, q_start, and q_end are the columns needed for main calculations. The remaining columns that are used for making some plots after the main peak finding calculations are q_start1, and q_end1, offset, and middle_of_arm. They are basically converting the coordinates as if chromosome are stacked one after the other from chromosome 1 to 22 (chromosome 2 starts where chromosome 1 ends etc.). I believe the rest of columns are not used at all anywhere in the code.

And for chromosome X, it's not looked at by the method. Sorry for the confusion, I should remove that. And the reason is that getting good copy number calls for chromosome X is difficult due to sex differences (and probably other reasons) and that can interfere with the analysis on the rest of the chromosomes when you build the background distributions. If you believe you have good copy number data for chromosome X, you need to modify the code so it includes chromosome X segments in the analysis.

abhijitcbio commented 1 year ago

Hi, Thanks for your quick reply. I have created the version you suggested

chromosome_info p_start p_end   q_start q_end
1       0       121700000       125100000       248956422
10      0       38000000        41600000        133797422
11      0       51000000        55800000        135086622
12      0       33200000        37800000        133275309
13      0       16500000        18900000        114364328
14      0       16100000        18200000        107043718
15      0       17500000        20500000        101991189
16      0       35300000        38400000        90338345
17      0       22700000        27400000        83257441

and successfully ran the python3.10 BISCUT_preprocessing.py part. But, I am getting the following error while running the Rscript BISCUT_peak_finding.R code.

[1] 8
[1][1] "2p Cancer amp tel"
 "3p Cancer amp tel"
[1] "1p Cancer amp tel"
[1][1] "4p Cancer amp tel"
 "5p Cancer amp tel"
[1] "7p Cancer amp tel"
[1] "6p Cancer amp tel"
[1] "9p Cancer amp tel"
[1] "10p Cancer amp tel"
[1] "11p Cancer amp tel"
[1] "8p Cancer amp tel"
[1] "16p Cancer amp cent"
[1] "12p Cancer amp tel"
[1] "18p Cancer amp tel"
[1] "16p Cancer amp tel"
[1] "17p Cancer amp tel"
[1] "13 Cancer amp cent"
[1] "14 Cancer amp cent"
[1] "20p Cancer amp tel"
[1] "19p Cancer amp tel"
[1] "15 Cancer amp cent"
[1] "one-sample ks p-value is 0.717160548429896, ks stat is 0.3"
[1] "1q Cancer amp cent"
[1] "one-sample ks p-value is 0.876160990712074, ks stat is 0.266666666666667"
[1] "21 Cancer amp cent"
[1] "22 Cancer amp cent"
[1] "3q Cancer amp cent"
[1] "4q Cancer amp cent"
[1] "2q Cancer amp cent"
[1] "1q Cancer amp tel"
[1] "21 Cancer amp tel"
[1][1] "11q Cancer amp cent"
 "12q Cancer amp cent"
[1] "one-sample ks p-value is 0.549508366396408, ks stat is 0.383333333333333"
[1] "9q Cancer amp cent"
[1] "8q Cancer amp cent"
[1] "10q Cancer amp cent"
[1] "12q Cancer amp tel"
[1] "11q Cancer amp tel"
[1] "8q Cancer amp tel"
[1] "10q Cancer amp tel"
[1] "20q Cancer amp cent"
[1] "19q Cancer amp cent"
[1] "19q Cancer amp tel"
[1] "5q Cancer amp cent"
[1] "6q Cancer amp cent"
[1] "16q Cancer amp cent"
[1] "17q Cancer amp cent"
[1] "7q Cancer amp cent"
[1] "16q Cancer amp tel"
[1] "18q Cancer amp cent"
[1] "18q Cancer amp tel"
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL
... 
[[38]]
NULL

[[39]]
NULL

Error in py_call_impl(callable, call_args$unnamed, call_args$named) :
  ValueError: No objects to concatenate
Run `reticulate::py_last_error()` for details.
Calls: all_processing -> py_call_impl
Execution halted

Any guess why it broke? Thanks for your help.

shahab-sarmashghi commented 1 year ago

Do you get error when you use the version of chromosome coordinates provided by us, or only when you use what you created?

abhijitcbio commented 1 year ago

My data is mapped on hg38, so can't run with the coordinates provided by you. This was generated by me using the hg38 Cytoband file https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz

shahab-sarmashghi commented 1 year ago

It's okay, you can still use our coordinates. Again as explained in the README, it's less important which reference genome you are using than a conservative approach to define telomeric and centromeric region. For example in your file there is no telomere so all short events in the chromosome arm ends are included in the calculation and will result in false signal. That's why we have chosen those boundaries from inspecting copy number segments first to make sure those regions are masked. And, there shouldn't be significantly large differences between hg19 and hg38 coordinates anyway so I would still suggest using our coordinates. When I ran BISCUT on hg38-based data, I only modified the gene list file hg19_genes_refseq_telcentfiltered_020518.txt to one based on hg38. This file is basically used to generate the list of genes that fall within the identified selection peaks for the output and is not a fundamental part of the algorithm as oppose to coordinates that can change the breakpoints and the results.

The reason I asked to use our coordinates is to make sure the error is not from that because two other users reported similar errors and I need to understand the reason to be able to better help you. Also, is the coordinate file you are using complete? The part you pasted above doesn't have all chromosomes from 1 to 22.

abhijitcbio commented 1 year ago

So, as you suggested I have kept the coordinate file the same and used the hg38 gene list, this the peak finding code -

thedate = '2023_08_04'
tumor_type <- 'Cancer'
ci <- 0.95 # confidence interval threshold around peaks, larger -> wider peaks
qval_thres <- 0.05 # q-value threshold for calling peaks after correcting for multiple hypothesis testing
telcent_thres <- 10^-3 # filtering events too close to tel/cent or whole arm
USE_BACKGROUND <- FALSE # Whether to use background distribution from TCGA samples, or build one from the current data
n <- 1000 # number of samples for ci distribution
bgdist <- 'beta'
n_cores <- 8 # numbers of CPU cores for parallelization over chromosome arms
set.seed(123456789) #random seed

genelocs_file <- 'docs/hg38_genes_refseq_telcentfiltered.txt'
genelocs <- read.csv(genelocs_file,sep='\t',header=T)
abslocs_file <- 'docs/SNP6_hg19_chromosome_locs_200605.txt'
abslocs <- read.table(abslocs_file,sep='\t',header=T)
tel_background_file <- 'docs/Cancer_tels.txt'
cent_background_file <- 'docs/Cancer_cents.txt'

But I am still getting this error -

Error in py_call_impl(callable, call_args$unnamed, call_args$named) :
  ValueError: No objects to concatenate
Run `reticulate::py_last_error()` for details.
Calls: all_processing -> py_call_impl
Execution halted

Also this is the full hg38 coordinate file I was using previously -

chromosome_info p_start p_end   q_start q_end
1       0       121700000       125100000       248956422
10      0       38000000        41600000        133797422
11      0       51000000        55800000        135086622
12      0       33200000        37800000        133275309
13      0       16500000        18900000        114364328
14      0       16100000        18200000        107043718
15      0       17500000        20500000        101991189
16      0       35300000        38400000        90338345
17      0       22700000        27400000        83257441
18      0       15400000        21500000        80373285
19      0       24200000        28100000        58617616
2       0       91800000        96000000        242193529
20      0       25700000        30400000        64444167
21      0       10900000        13000000        46709983
22      0       13700000        17400000        50818468
3       0       87800000        94000000        198295559
4       0       48200000        51800000        190214555
5       0       46100000        51400000        181538259
6       0       58500000        62600000        170805979
7       0       58100000        62100000        159345973
8       0       43200000        47200000        145138636
9       0       42200000        45500000        138394717
shahab-sarmashghi commented 1 year ago

So you get the error when running all_processing() function which tries to read files and concatenate them. Do you have any .txt file immediately under your tumor_type directory which seems to be named Cancer?

shahab-sarmashghi commented 1 year ago

Also, what is the header row of your gene list? They should be Chr Start End Cytoband Gene RefSeqName

abhijitcbio commented 1 year ago

So this is what I have under the results folder

results_2023_08_04_0.95/
├── Cancer
│   ├── Cancer_21_amp_cent_p_0.95_iter1.pdf
│   └── summary
├── KS_pvalues_2023_08_04_0.95.txt
└── stats
    ├── Cancer_21_amp_cent_p_1.rds
    └── summary

So, no .txt file under Cancer, and my gene list looks like this

Chr     Start   End     Cytoband        Gene    RefSeqName
1       65419   71585   p36.33  OR4F5   ENSG00000186092.6
1       450703  451697  p36.33  OR4F29  ENSG00000284733.1
1       685679  686673  p36.33  OR4F16  ENSG00000284662.1

I have used the Ensembl ids instead of ref-seq ones.

shahab-sarmashghi commented 1 year ago

If the peak finding step works you should get many .txt files which summarize the identified peaks. It could be there are errors that are caught by tryCatch. Can you change the line 639 from: bar <- tryCatch(do_arm_gistic(a,tumor_type,d,tc,mode='overlap'),error=errorfunc) to: bar <- do_arm_gistic(a,tumor_type,d,tc,mode='overlap') and see if you get any error when running the for loop at line 635? If not, try to do the same with lines 616 and 617 and run it again. This might help to see where the error is.

abhijitcbio commented 1 year ago

So, I changed the line 639 as you suggested and got this error -


[1] 8
[1][1][1] "13 Cancer amp cent"
 "2q Cancer amp cent" "1q Cancer amp cent"

[1][1][1][1][1] "21 Cancer amp cent" "15 Cancer amp cent" "22 Cancer amp cent" "3q Cancer amp cent"

 "14 Cancer amp cent"

[1] "one-sample ks p-value is 0.00466200466200217, ks stat is 0.833333333333333"
[1][1][1][1] "4q Cancer amp cent" "5q Cancer amp cent"[1] "6q Cancer amp cent"

 "8q Cancer amp cent"
 "9q Cancer amp cent"

[1] "11q Cancer amp cent"
[1] "10q Cancer amp cent"
[1][1][1] "19q Cancer amp cent" "12q Cancer amp cent" "17q Cancer amp cent"

[1] "16q Cancer amp cent"
[1] "20q Cancer amp cent"
[1] "original peak 0.970035372074872"
[1] "7q Cancer amp cent"
[1] "18q Cancer amp cent"
Error in { : task 1 failed - "$ operator is invalid for atomic vectors"
Calls: %dopar% -> <Anonymous>
Execution halted
shahab-sarmashghi commented 1 year ago

I think the parallel for loop is masking the informative error message. Can you change line 635 from: foreach(i=1:length(all_arms)) %dopar% { to: for(i in 1:length(all_arms)) { and re-run it?

abhijitcbio commented 1 year ago

It gives you the following error

Error in tabprobes$Var1 : $ operator is invalid for atomic vectors
Calls: do_arm_gistic
In addition: Warning messages:
1: In ifelse(is.na(as.integer(arm)), as.integer(substr(arm, 1, nchar(arm) -  :
  NAs introduced by coercion
2: In ifelse(is.na(as.integer(arm)), substr(arm, nchar(arm), nchar(arm) +  :
  NAs introduced by coercion
Execution halted

I tried to print this -

tabprobes <- tablify(probes,chromosome,pq,telcent,F)
print (tabprobes)

It gives you the following result

integer(0)
shahab-sarmashghi commented 1 year ago

Okay thank you for trying. I don't think the error we are looking for is the one above. I need to take a closer look at this, do you want to meet over zoom?

abhijitcbio commented 1 year ago

Here is my email abhijit@lji.org, we can talk more about that.

shahab-sarmashghi commented 1 year ago

Hi Abhijit,

I think I found the problem, it's so silly! It's the print statement at line 594 print(left_boundary, right_boundary) which for some reason gives error in the latest version of R! I had to upgrade to R 4.3.1 to be able to reproduce the error, it was fine with the R version I had tested BISCUT with and recommended to use. I just made a quick fix by commenting that line out. Please pull and let me know if the error is resolved on TCGA data. If it worked on TCGA data but failed on your data, you can open another issue. Thank you for your help and patience.