eonurk / cinaR

A differential and enrichment analyses pipeline for bulk ATAC-seq (and RNA-seq)
https://eonurk.github.io/cinaR/
12 stars 3 forks source link

Error in .get_data_frame_col_as_numeric(df, granges_cols[["start"]]) : some values in the "Start" column cannot be turned into numeric values #6

Closed sngholee closed 1 year ago

sngholee commented 1 year ago

Hello,

I am attempting to run cinaR with this code. I have 20 samples, composed of 2 replicates of 10. results <- cinaR(consensus_matrix, contrasts, reference.genome = "mm10", additional.covariates = c(rep("H", 5), rep("L", 5), rep("H", 5), rep("L", 5)), batch.correction = T, batch.information = c(rep(0, 10), rep(1,10)))

I have given this as contrasts: contrasts<- c("H0", "H2", "H24", "H48", "H72", "L0", "L2", "L24", "L48", "L72","H0", "H2", "H24", "H48", "H72", "L0", "L2", "L24", "L48", "L72")

head(consensus_matrix) shows the below (only the fifrst 3 samples shown)

   chr   start     end Hi_0_REP1.mLb.clN.sorted.bam Hi_2_REP1.mLb.clN.sorted.bam Hi_24_REP1.mLb.clN.sorted.bam
1 chr1 3008684 3009119                            8                           12                             1
2 chr1 3012311 3012785                            8                            3                             1
3 chr1 3037464 3037989                            7                           20                             0
4 chr1 3046437 3046652                            6                            4                             0
5 chr1 3049581 3049922                            8                            4                             0
6 chr1 3053849 3054004                            0                           10                             0

dim(consensus_matrix) [1] 312759 23

The traceback of the code shows:

Error in .get_data_frame_col_as_numeric(df, granges_cols[["start"]]) : some values in the "Start" column cannot be turned into numeric values
8. stop(wmsg("some values in the ", "\"", names(df)[[col]], "\" ", "column cannot be turned into numeric values"))
7. .get_data_frame_col_as_numeric(df, granges_cols[["start"]])
6. makeGRangesFromDataFrame(from, keep.extra.columns = TRUE)
5. asMethod(object)
4. as(seqnames, "GRanges")
3. GenomicRanges::GRanges(bed)
2. annotatePeaks(cp.filtered, reference.genome = reference.genome, show.annotation.pie = show.annotation.pie, verbose = verbose)
1. cinaR(consensus_matrix, contrasts, reference.genome = "mm10", additional.covariates = c(rep("H", 5), rep("L", 5), rep("H", 5), rep("L", 5)), batch.correction = T, batch.information = c(rep(0, 10), rep(1, 10)))

I've checked that the start and end columns contain only numeric values and no NAs. I'd appreciate any advice.

Thank you

eonurk commented 1 year ago

Can you check whether you can run the example script?

data(atac_seq_consensus_bm) # calls 'bed'

# a vector for comparing the examples
contrasts <- sapply(strsplit(colnames(bed), split = "-", fixed = TRUE),
                    function(x){x[1]})[4:25]

results <- cinaR(bed, contrasts, reference.genome = "mm10")
sngholee commented 1 year ago

I get this error when running that on the example dataset you provided:

>> Experiment type: ATAC-Seq
>> Matrix is filtered!
Package "TxDb.Mmusculus.UCSC.mm10.knownGene" needed for this
             function to work. Please install it.FALSE
Error in abs(cp.filtered.annotated$distanceToTSS) : 
  non-numeric argument to mathematical function
sngholee commented 1 year ago

It seems I was missing the "TxDb.Mmusculus.UCSC.mm10.knownGene" package, so I installed it, and the example code worked. However, I'm still getting the same error on my own dataset.

eonurk commented 1 year ago

Can you run str function on your consensus_peak data and also on the example bed data?

sngholee commented 1 year ago

Yes.

#The example bed file
> str(bed)
'data.frame':   1000 obs. of  25 variables:
 $ Chr                        : chr  "chr5" "chr17" "chr8" "chr4" ...
 $ Start                      : num  2.48e+07 8.16e+06 4.06e+07 1.45e+08 1.81e+08 ...
 $ End                        : num  2.48e+07 8.16e+06 4.06e+07 1.45e+08 1.81e+08 ...
 $ B6-18mo-M-BM-47-GT18-01783 : num  1592 109 72 110 2452 ...
 $ B6-18mo-F-BM-42-GT18-01780 : num  1261 79 54 85 1825 ...
 $ B6-18mo-M-BM-45-GT18-01781 : num  830 90 44 65 1196 ...
 $ B6-18mo-F-BM-40-GT18-01778 : num  2127 99 90 146 2976 ...
 $ B6-18mo-F-BM-41-GT18-01779 : num  1870 115 42 97 2691 ...
 $ NZO-18mo-F-BM-35-GT18-03804: num  961 184 56 102 1507 ...
 $ NZO-18mo-F-BM-36-GT18-03805: num  1021 201 59 89 1502 ...
 $ NZO-18mo-F-BM-42-GT18-03806: num  698 202 43 74 1284 ...
 $ NZO-18mo-M-BM-38-GT18-03807: num  724 201 37 99 1130 61 108 115 113 28 ...
 $ NZO-18mo-M-BM-39-GT18-03808: num  578 147 47 95 958 39 74 125 117 39 ...
 $ NZO-18mo-M-BM-40-GT18-03809: num  529 150 24 74 825 34 53 55 68 15 ...
 $ B6-3mo-F-BM-4-GT18-04678   : num  1648 141 70 120 2673 ...
 $ B6-3mo-F-BM-5-GT18-04679   : num  633 113 34 69 967 36 54 78 132 11 ...
 $ B6-3mo-M-BM-1-GT18-04680   : num  2174 203 102 137 3194 ...
 $ B6-3mo-M-BM-2-GT18-04681   : num  646 107 42 74 1155 ...
 $ B6-3mo-M-BM-3-GT18-04682   : num  787 171 35 103 1421 ...
 $ NZO-3mo-F-BM-245-GT18-10918: num  763 149 29 74 1438 ...
 $ NZO-3mo-F-BM-243-GT18-10916: num  775 156 35 47 1215 ...
 $ NZO-3mo-M-BM-251-GT18-10919: num  737 175 16 51 1257 ...
 $ NZO-3mo-M-BM-253-GT18-10921: num  898 163 32 68 1515 ...
 $ NZO-3mo-F-BM-244-GT18-10917: num  702 173 26 70 1263 ...
 $ NZO-3mo-M-BM-252-GT18-10920: num  705 222 44 82 1370 65 84 130 136 15 ...
#My own consensus_matrix data 

> str(consensus_matrix)
'data.frame':   312759 obs. of  23 variables:
 $ chr                          : chr  "chr1" "chr1" "chr1" "chr1" ...
 $ start                        : int  3008684 3012311 3037464 3046437 3049581 3053849 3062522 3066449 3083336 3118951 ...
 $ end                          : int  3009119 3012785 3037989 3046652 3049922 3054004 3063633 3066692 3083661 3119290 ...
 $ Hi_0_REP1.mLb.clN.sorted.bam : int  8 8 7 6 8 0 40 8 6 4 ...
 $ Hi_2_REP1.mLb.clN.sorted.bam : int  12 3 20 4 4 10 49 7 7 3 ...
 $ Hi_24_REP1.mLb.clN.sorted.bam: int  1 1 0 0 0 0 5 2 1 1 ...
 $ Hi_48_REP1.mLb.clN.sorted.bam: int  7 9 10 10 2 2 52 6 9 6 ...
 $ Hi_72_REP1.mLb.clN.sorted.bam: int  8 6 17 7 4 0 62 8 12 8 ...
 $ Lo_0_REP1.mLb.clN.sorted.bam : int  7 10 14 7 7 7 58 3 9 12 ...
 $ Lo_2_REP1.mLb.clN.sorted.bam : int  11 10 11 3 4 2 73 7 13 11 ...
 $ Lo_24_REP1.mLb.clN.sorted.bam: int  6 4 6 5 2 1 62 10 4 4 ...
 $ Lo_48_REP1.mLb.clN.sorted.bam: int  4 7 8 5 3 2 68 6 3 7 ...
 $ Lo_72_REP1.mLb.clN.sorted.bam: int  13 4 8 6 5 0 51 11 10 5 ...
 $ Hi_0_REP2.mLb.clN.sorted.bam : int  8 7 12 12 3 0 103 16 4 7 ...
 $ Hi_2_REP2.mLb.clN.sorted.bam : int  12 21 13 12 3 8 153 16 2 16 ...
 $ Hi_24_REP2.mLb.clN.sorted.bam: int  7 16 8 4 8 2 142 11 5 10 ...
 $ Hi_48_REP2.mLb.clN.sorted.bam: int  11 24 8 13 12 7 164 16 8 9 ...
 $ Hi_72_REP2.mLb.clN.sorted.bam: int  4 12 6 17 6 3 110 13 6 6 ...
 $ Lo_0_REP2.mLb.clN.sorted.bam : int  12 10 6 12 13 4 77 9 5 6 ...
 $ Lo_2_REP2.mLb.clN.sorted.bam : int  5 9 14 12 6 4 98 13 8 14 ...
 $ Lo_24_REP2.mLb.clN.sorted.bam: int  3 12 8 7 4 1 103 6 3 10 ...
 $ Lo_48_REP2.mLb.clN.sorted.bam: int  4 18 13 8 6 2 123 13 5 9 ...
 $ Lo_72_REP2.mLb.clN.sorted.bam: int  1 10 14 4 7 2 109 7 5 7 ...
eonurk commented 1 year ago

Can you try following:

consensus_matrix[2:nrow(consensus_matrix)] <- lapply(consensus_matrix[2:nrow(consensus_matrix)], as.numeric)

This should make all integers numeric, I don't see how that would affect the code but better if we give it a shot. If this does not work, I think best way would be me try to replicate your error with your consensus matrix, you can mail it to me if you can.

sngholee commented 1 year ago

I tried this with ncol rather than nrow.

consensus_matrix[2:ncol(consensus_matrix)] <- lapply(consensus_matrix[2:ncol(consensus_matrix)], as.numeric) But I am still getting the same error. I will mail you the matrix. Thank you

eonurk commented 1 year ago

Hi again,

Sorry for the late reply, busy weeks.

Looking at your data, I noticed you have unusual chromosome annotations, so you need to filter these out:

> unique(cp$chr)[grepl("random|Un", unique(cp$chr))]
 [1] "chr1_GL456210_random" "chr1_GL456211_random" "chr1_GL456212_random" "chr1_GL456213_random"
 [5] "chr1_GL456221_random" "chr4_GL456216_random" "chr4_GL456350_random" "chr4_JH584293_random"
 [9] "chr4_JH584294_random" "chr4_JH584295_random" "chr5_GL456354_random" "chr5_JH584296_random"
[13] "chr5_JH584297_random" "chr5_JH584298_random" "chr5_JH584299_random" "chr7_GL456219_random"
[17] "chrUn_GL456239"       "chrUn_GL456359"       "chrUn_GL456360"       "chrUn_GL456366"      
[21] "chrUn_GL456367"       "chrUn_GL456368"       "chrUn_GL456370"       "chrUn_GL456372"      
[25] "chrUn_GL456378"       "chrUn_GL456379"       "chrUn_GL456382"       "chrUn_GL456385"      
[29] "chrUn_GL456387"       "chrUn_GL456389"       "chrUn_GL456392"       "chrUn_GL456393"      
[33] "chrUn_JH584304"       "chrX_GL456233_random" "chrY_JH584300_random" "chrY_JH584301_random"
[37] "chrY_JH584302_random" "chrY_JH584303_random"

I did that with cp.filtered <- cp[!grepl("random|Un|chrM", cp$chr),]. Then, ran cinaR without the additional.batch information as you already have that info within your contrasts. Also, I did not run the enrichment analysis as there are some DA results which are empty. I need to update the pipeline to take care of this small error, but there is a way around it, first run DA.

results <- cinaR::cinaR(cp.filtered, contrasts, reference.genome = "mm10", batch.correction = T, batch.information = c(rep(0, 10), rep(1,10)), run.enrichment = F)

and then remove empty DA results: results$DA.peaks <- Filter(length, results$DA.peaks)

finally run enrichment again: results.enrichment <- cinaR::run_enrichment(results, reference.genome = "mm10")

Hope these helps!

Best, Onur