RegulatoryGenomicsGroup / chicdiff

A differential caller for capture Hi-C data
4 stars 3 forks source link

Error in argument Type #20

Closed RomeroMatt closed 8 months ago

RomeroMatt commented 8 months ago

Hello, This is my first time using Chicdiff and I am running into some problems when using my own data. I was able to run the test dataset correctly.

I set up my files exactly as the test dataset (to the best of my knowledge), but when running the chicdiffPipeline I get two errors:

1) *** Running getRegionUniverse

Error in readAndFilterPeakMatrix(peakFiles = peakFiles, score = score, : All specified targetColumns must be present in the peak file(s)

I thought this was related to my targetColumns not having the correct names in chicdiff.settings so I changed them via this command in order to correctly name the targetColumns (the previous targetColumns names lacked the "_peaks": chicdiff.settings[["targetColumns"]] <- c("h9_pchic1_peaks", "h9_pchic2_peaks", "fetal_pchic1_peaks", "fetal_pchic2_peaks")

2) after running the above command I receive this error message when trying to run chicdiffpipeline:

*** Running getRegionUniverse

Error in !is.na(.SD[, cols, with = FALSE]) : invalid argument type

I'm not exactly sure what is wrong. It could be my target columns or the NA's in my dataset?

Any help/insight would be great!

Thanks for the tools! -Matt

mspivakov commented 8 months ago

Hi Matt

It is a bit difficult to say what is happening from this description. May suggest that you run debug(getRegionUniverse) and execute this function step by step with your input data? This may make it clearer what exactly the problem is.

On Thu, 11 Jan 2024, 02:04 RomeroMatt, @.***> wrote:

Hello, This is my first time using Chicdiff and I am running into some problems when using my own data. I was able to run the test dataset correctly.

I set up my files exactly as the test dataset (to the best of my knowledge), but when running the chicdiffPipeline I get two errors:

  1. *** Running getRegionUniverse

Error in readAndFilterPeakMatrix(peakFiles = peakFiles, score = score, : All specified targetColumns must be present in the peak file(s)

I thought this was related to my targetColumns not having the correct names in chicdiff.settings so I changed them via this command in order to correctly name the targetColumns (the previous targetColumns names lacked the "_peaks": chicdiff.settings[["targetColumns"]] <- c("h9_pchic1_peaks", "h9_pchic2_peaks", "fetal_pchic1_peaks", "fetal_pchic2_peaks")

  1. after running the above command I receive this error message when trying to run chicdiffpipeline:

*** Running getRegionUniverse

Error in !is.na(.SD[, cols, with = FALSE]) : invalid argument type

I'm not exactly sure what is wrong. It could be my target columns or the NA's in my dataset?

Any help/insight would be great!

Thanks for the tools! -Matt

— Reply to this email directly, view it on GitHub https://github.com/RegulatoryGenomicsGroup/chicdiff/issues/20, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVB2YG3X7PGWZ4C53C33DYN5CETAVCNFSM6AAAAABBVXNHP6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3TKNJWGY2TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RomeroMatt commented 8 months ago

Thanks @mspivakov! I ran debug() and received this message: Error in !(all.equal(names(chicagoData), names(countData))) : invalid argument type In addition: Warning message: In is.na(chicagoData) : is.na() applied to non-(list or vector) of type 'closure'

Does this help? The only thing I can think of is that my sample names may not match? I ran chicago with one file name, but for ease I changed the .chinput, .Rds outputs to another file name. Could the original file name be annotated somewhere in the .chinput or .Rds file that could be causing this error? I noticed in the .chinput it had the original file name in the header. However, I don't see the old file name in the .Rds file.

I hope this makes sense. Thanks! -Matt

mspivakov commented 8 months ago

My feeling is there is something wrong with the chicagoData vector. Can you check what it is?

On Fri, 12 Jan 2024, 22:39 RomeroMatt, @.***> wrote:

Thanks @mspivakov https://github.com/mspivakov! I ran debug() and received this message: Error in !(all.equal(names(chicagoData), names(countData))) : invalid argument type In addition: Warning message: In is.na(chicagoData) : is.na() applied to non-(list or vector) of type 'closure'

Does this help? The only thing I can think of is that my sample names may not match? I ran chicago with one file name, but for ease I changed the .chinput, .Rds outputs to another file name. Could the original file name be annotated somewhere in the .chinput or .Rds file that could be causing this error? I noticed in the .chinput it had the original file name in the header. However, I don't see the old file name in the .Rds file.

I hope this makes sense. Thanks! -Matt

— Reply to this email directly, view it on GitHub https://github.com/RegulatoryGenomicsGroup/chicdiff/issues/20#issuecomment-1890075163, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVB24I2JSIHQ74LTBNGRTYOG3THAVCNFSM6AAAAABBVXNHP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQGA3TKMJWGM . You are receiving this because you were mentioned.Message ID: @.***>

RomeroMatt commented 8 months ago

Sure thing. When I look at the file using readRDS() I receive the output:

readRDS(file = "rds/h9_pchic1.Rds") An object of class "chicagoData" Slot 'x': baitID otherEndID distbin s_j N otherEndLen distSign isBait2bait refBinMean s_i NNb NNboe tlb tblb Tmean 1: 150 654 0.2522839 2 5070 2668071 FALSE NA 0.9379224 8 9 (19,24] [ 86, 112) 0.0007526919 2: 150 791 0.2522839 1 5133 3375482 FALSE NA 1.0509822 4 4 [0,5] [ 86, 112) 0.0002732269 3: 150 833 0.2522839 1 5159 3591608 FALSE NA 1.0467986 4 4 (7,9] [ 86, 112) 0.0004597047 4: 150 1036 0.2522839 1 5447 4642873 TRUE NA 1.1856523 4 3 (24,32]B2B [ 86, 112) 0.0010240360 5: 150 1118 0.2522839 1 5039 5066841 TRUE NA 1.4224122 4 3 (80,107]B2B [ 86, 112) 0.0023579282


14536707: 561087 560756 (1.7e+06,1.72e+06] 1.4868552 1 5126 -1711883 FALSE 0.1717899 0.9737189 1 1 (16,19] [394, 432) 0.0013842182 14536708: 561087 560734 (1.8e+06,1.82e+06] 1.4868552 1 5015 -1824811 FALSE 0.1684363 0.8943896 1 1 (24,202] [394, 432) 0.0021671042 14536709: 561087 560736 (1.8e+06,1.82e+06] 1.4868552 1 5371 -1814584 FALSE 0.1684363 0.8943896 1 1 (24,202] [394, 432) 0.0021671042 14536710: 561087 560733 (1.82e+06,1.85e+06] 1.4868552 1 5127 -1829883 FALSE 0.1684970 0.8943896 1 1 (24,202] [394, 432) 0.0021671042 14536711: 561087 560717 (1.9e+06,1.92e+06] 1.4868552 1 5079 -1913529 FALSE 0.1660765 1.0138742 1 1 (11,13] [394, 432) 0.0010937541 Bmean log.p log.w log.q score 1: 0.03407789 -6.816363 3.434031 -10.250394 2.641663 2: 0.03497283 -3.379290 2.834952 -6.214242 0.000000 3: 0.03403500 -3.399949 2.676528 -6.076477 0.000000 4: 0.03502288 -3.356894 2.021421 -5.378315 0.000000 5: 0.04066656 -3.185377 1.799005 -4.984382 0.000000


14536707: 0.24738431 -1.609595 4.547645 -6.157240 0.000000 14536708: 0.22110701 -1.696661 4.389699 -6.086360 0.000000 14536709: 0.22162264 -1.694772 4.403638 -6.098410 0.000000 14536710: 0.22085397 -1.697590 4.382812 -6.080402 0.000000 14536711: 0.24590371 -1.615559 4.271655 -5.887213 0.000000 Slot "params": distFunParams, dispersion.samples, dispersion. Slot "settings": length 30. Non-defaults are: rmapfile baitmapfile "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/hg38_chicago_input_5kb.rmap" "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/hg38_chicago_input_5kb.baitmap" nperbinfile nbaitsperbinfile "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/GW_PCHIC_hg38_5kb.npb" "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/GW_PCHIC_hg38_5kb.nbpb" proxOEfile maxLBrownEst "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/GW_PCHIC_hg38_5kb.poe" "2e+06" minFragLen maxFragLen "4000" "6000" binsize "25000"

but I don't see anything immediately wrong other than "NAs" that are present. Do NA's throw an error? I noticed in my peak files there are NA's as well. However, the peakfiles in chicdiffData do not have NA's. Could this be an issue?

mspivakov commented 8 months ago

I think at this stage the easiest would be if you could share with me a reproducible example with me and the input data (feel free to subset to a small chunk as long as it reproduces the exact issue), and I'll have a look at my end.

On Fri, 12 Jan 2024, 23:14 RomeroMatt, @.***> wrote:

Sure thing. When I look at the file using readRDS() I receive the output:

readRDS(file = "rds/h9_pchic1.Rds") An object of class "chicagoData" Slot 'x': baitID otherEndID distbin s_j N otherEndLen distSign isBait2bait refBinMean s_i NNb NNboe tlb tblb Tmean 1: 150 654 0.2522839 2 5070 2668071 FALSE NA 0.9379224 8 9 (19,24] [ 86, 112) 0.0007526919 2: 150 791 0.2522839 1 5133 3375482 FALSE NA 1.0509822 4 4 [0,5] [ 86, 112) 0.0002732269 3: 150 833 0.2522839 1 5159 3591608 FALSE NA 1.0467986 4 4 (7,9] [ 86, 112) 0.0004597047 4: 150 1036 0.2522839 1 5447 4642873 TRUE NA 1.1856523 4 3 (24,32]B2B [ 86, 112) 0.0010240360 5: 150 1118 0.2522839 1 5039 5066841 TRUE NA 1.4224122 4 3 (80,107]B2B [ 86, 112) 0.0023579282

14536707: 561087 560756 (1.7e+06,1.72e+06] 1.4868552 1 5126 -1711883 FALSE 0.1717899 0.9737189 1 1 (16,19] [394, 432) 0.0013842182 14536708: 561087 560734 (1.8e+06,1.82e+06] 1.4868552 1 5015 -1824811 FALSE 0.1684363 0.8943896 1 1 (24,202] [394, 432) 0.0021671042 14536709: 561087 560736 (1.8e+06,1.82e+06] 1.4868552 1 5371 -1814584 FALSE 0.1684363 0.8943896 1 1 (24,202] [394, 432) 0.0021671042 14536710: 561087 560733 (1.82e+06,1.85e+06] 1.4868552 1 5127 -1829883 FALSE 0.1684970 0.8943896 1 1 (24,202] [394, 432) 0.0021671042 14536711: 561087 560717 (1.9e+06,1.92e+06] 1.4868552 1 5079 -1913529 FALSE 0.1660765 1.0138742 1 1 (11,13] [394, 432) 0.0010937541 Bmean log.p log.w log.q score 1: 0.03407789 -6.816363 3.434031 -10.250394 2.641663 2: 0.03497283 -3.379290 2.834952 -6.214242 0.000000 3: 0.03403500 -3.399949 2.676528 -6.076477 0.000000 4: 0.03502288 -3.356894 2.021421 -5.378315 0.000000 5: 0.04066656 -3.185377 1.799005 -4.984382 0.000000

14536707: 0.24738431 -1.609595 4.547645 -6.157240 0.000000 14536708: 0.22110701 -1.696661 4.389699 -6.086360 0.000000 14536709: 0.22162264 -1.694772 4.403638 -6.098410 0.000000 14536710: 0.22085397 -1.697590 4.382812 -6.080402 0.000000 14536711: 0.24590371 -1.615559 4.271655 -5.887213 0.000000 Slot "params": distFunParams, dispersion.samples, dispersion. Slot "settings": length 30. Non-defaults are: rmapfile baitmapfile "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/hg38_chicago_input_5kb.rmap" "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/hg38_chicago_input_5kb.baitmap" nperbinfile nbaitsperbinfile "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/GW_PCHIC_hg38_5kb.npb" "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/GW_PCHIC_hg38_5kb.nbpb" proxOEfile maxLBrownEst "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/GW_PCHIC_hg38_5kb.poe" "2e+06" minFragLen maxFragLen "4000" "6000" binsize "25000"

but I don't see anything immediately wrong other than "NAs" that are present. Do NA's throw an error? I noticed in my peak files there are NA's as well. However, the peakfiles in chicdiffData do not have NA's. Could this be an issue?

— Reply to this email directly, view it on GitHub https://github.com/RegulatoryGenomicsGroup/chicdiff/issues/20#issuecomment-1890130516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVB24QOXVPWJKSGQBMHPTYOG7WPAVCNFSM6AAAAABBVXNHP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQGEZTANJRGY . You are receiving this because you were mentioned.Message ID: @.***>

mspivakov commented 8 months ago

... the only thing is - am I right that in the first set of rows printed, unlike in the last set, distbin is missing?

On Sat, 13 Jan 2024, 07:59 Mikhail Spivakov, @.***> wrote:

I think at this stage the easiest would be if you could share with me a reproducible example with me and the input data (feel free to subset to a small chunk as long as it reproduces the exact issue), and I'll have a look at my end.

On Fri, 12 Jan 2024, 23:14 RomeroMatt, @.***> wrote:

Sure thing. When I look at the file using readRDS() I receive the output:

readRDS(file = "rds/h9_pchic1.Rds") An object of class "chicagoData" Slot 'x': baitID otherEndID distbin s_j N otherEndLen distSign isBait2bait refBinMean s_i NNb NNboe tlb tblb Tmean 1: 150 654 0.2522839 2 5070 2668071 FALSE NA 0.9379224 8 9 (19,24] [ 86, 112) 0.0007526919 2: 150 791 0.2522839 1 5133 3375482 FALSE NA 1.0509822 4 4 [0,5] [ 86, 112) 0.0002732269 3: 150 833 0.2522839 1 5159 3591608 FALSE NA 1.0467986 4 4 (7,9] [ 86, 112) 0.0004597047 4: 150 1036 0.2522839 1 5447 4642873 TRUE NA 1.1856523 4 3 (24,32]B2B [ 86, 112) 0.0010240360 5: 150 1118 0.2522839 1 5039 5066841 TRUE NA 1.4224122 4 3 (80,107]B2B [ 86, 112) 0.0023579282

14536707: 561087 560756 (1.7e+06,1.72e+06] 1.4868552 1 5126 -1711883 FALSE 0.1717899 0.9737189 1 1 (16,19] [394, 432) 0.0013842182 14536708: 561087 560734 (1.8e+06,1.82e+06] 1.4868552 1 5015 -1824811 FALSE 0.1684363 0.8943896 1 1 (24,202] [394, 432) 0.0021671042 14536709: 561087 560736 (1.8e+06,1.82e+06] 1.4868552 1 5371 -1814584 FALSE 0.1684363 0.8943896 1 1 (24,202] [394, 432) 0.0021671042 14536710: 561087 560733 (1.82e+06,1.85e+06] 1.4868552 1 5127 -1829883 FALSE 0.1684970 0.8943896 1 1 (24,202] [394, 432) 0.0021671042 14536711: 561087 560717 (1.9e+06,1.92e+06] 1.4868552 1 5079 -1913529 FALSE 0.1660765 1.0138742 1 1 (11,13] [394, 432) 0.0010937541 Bmean log.p log.w log.q score 1: 0.03407789 -6.816363 3.434031 -10.250394 2.641663 2: 0.03497283 -3.379290 2.834952 -6.214242 0.000000 3: 0.03403500 -3.399949 2.676528 -6.076477 0.000000 4: 0.03502288 -3.356894 2.021421 -5.378315 0.000000 5: 0.04066656 -3.185377 1.799005 -4.984382 0.000000

14536707: 0.24738431 -1.609595 4.547645 -6.157240 0.000000 14536708: 0.22110701 -1.696661 4.389699 -6.086360 0.000000 14536709: 0.22162264 -1.694772 4.403638 -6.098410 0.000000 14536710: 0.22085397 -1.697590 4.382812 -6.080402 0.000000 14536711: 0.24590371 -1.615559 4.271655 -5.887213 0.000000 Slot "params": distFunParams, dispersion.samples, dispersion. Slot "settings": length 30. Non-defaults are: rmapfile baitmapfile "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/hg38_chicago_input_5kb.rmap" "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/hg38_chicago_input_5kb.baitmap" nperbinfile nbaitsperbinfile "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/GW_PCHIC_hg38_5kb.npb" "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/GW_PCHIC_hg38_5kb.nbpb" proxOEfile maxLBrownEst "/home/mromero/CHiC/chicagoTools/hg38_ref/5kb/GW_PCHIC_hg38_5kb.poe" "2e+06" minFragLen maxFragLen "4000" "6000" binsize "25000"

but I don't see anything immediately wrong other than "NAs" that are present. Do NA's throw an error? I noticed in my peak files there are NA's as well. However, the peakfiles in chicdiffData do not have NA's. Could this be an issue?

— Reply to this email directly, view it on GitHub https://github.com/RegulatoryGenomicsGroup/chicdiff/issues/20#issuecomment-1890130516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVB24QOXVPWJKSGQBMHPTYOG7WPAVCNFSM6AAAAABBVXNHP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQGEZTANJRGY . You are receiving this because you were mentioned.Message ID: @.***>

RomeroMatt commented 8 months ago

Hi @mspivakov , thanks for the help! I have subsetted data and attached here. It produces the same error on my end. I've also included my script to help. I get two error messages. one is related to column names: `*** Running getRegionUniverse

Error in readAndFilterPeakMatrix(peakFiles = peakFiles, score = score, : All specified targetColumns must be present in the peak file(s)`

and the other is after I add the column names with the below code: chicdiff.settings[["targetColumns"]] <- c("h9_pchic1_peaks", "h9_pchic2_peaks", "fetal_pchic1_peaks", "fetal_pchic2_peaks")

Thanks for the help. Please let me know if I'm missing anything. [Uploading chicDiff_subsetData_RomeroMatt.tar.gz…]()

mspivakov commented 8 months ago

Hi Matt, I started looking at your data, but realised that I don't have the peak file(s). Can you please share them as well for the subsetted data, as well as your code to run chicdiff? Thanks!

On Tue, 16 Jan 2024 at 21:39, RomeroMatt @.***> wrote:

Hi @mspivakov https://github.com/mspivakov , thanks for the help! I have subsetted data and attached here. It produces the same error on my end. I've also included my script to help. I get two error messages. one is related to column names: `*** Running getRegionUniverse

Error in readAndFilterPeakMatrix(peakFiles = peakFiles, score = score, : All specified targetColumns must be present in the peak file(s)`

and the other is after I add the column names with the below code: chicdiff.settings[["targetColumns"]] <- c("h9_pchic1_peaks", "h9_pchic2_peaks", "fetal_pchic1_peaks", "fetal_pchic2_peaks")

Thanks for the help. Please let me know if I'm missing anything. chicDiff_subsetData_RomeroMatt.tar.gz https://github.com/RegulatoryGenomicsGroup/chicdiff/files/13956017/chicDiff_subsetData_RomeroMatt.tar.gz

— Reply to this email directly, view it on GitHub https://github.com/RegulatoryGenomicsGroup/chicdiff/issues/20#issuecomment-1894557389, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVB2Z5T22IWHIEK2QLWSDYO3XRFAVCNFSM6AAAAABBVXNHP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGU2TOMZYHE . You are receiving this because you were mentioned.Message ID: @.***>

RomeroMatt commented 8 months ago

Sorry @mspivakov ! Thanks for checking and for the help.

I have included the peaks file and the code that I used to run chicdiff, "subset_Chicdiff.R" subsetData_chicdiff_RomeroMatt.tar.gz

Thanks again!

mspivakov commented 8 months ago

Fortunately, this particular error was straightforward to fix. The names of entries in the countData/chicagoData lists should be the same as the respective column names in the peakFile(s), which was not the case.

Amending the definitions of these lists as follows has got me past that error:

countData <- list(
  h9 = c(h9_pchic1_peaks = "subset_h9_pchic1.chinput",
         h9_pchic2_peaks = "subset_h9_pchic2.chinput"
  ),

  fetal = c(fetal_pchic1_peaks = "subset_fetal_pchic1.chinput",
            fetal_pchic2_peaks = "subset_fetal_pchic2.chinput"
  )
)

# setting up rds files

# setting up chicagoData
chicagoData <- list(
  h9 = c(h9_pchic1_peaks = "subset_h9_pchic1.Rds",
         h9_pchic2_peaks = "subset_h9_pchic2.Rds"
  ),

  fetal = c(fetal_pchic1_peaks = "subset_fetal_pchic1.Rds",
            fetal_pchic2_peaks = "subset_fetal_pchic2.Rds"
  )
)

I then did get another error (as follows), but I suspect it could be due to the tiny subset of the data it is run on.

*** Running getControlRegionUniverse

Error in setnames(x, value) : 
  Can't assign 3 names to a 2 column data.table

Let me know if the problem persists on the full dataset.

RomeroMatt commented 8 months ago

Oh my!!! What a mistake on my side of things! I changed my names based on your suggestions and it worked! However, it didn't take, but a few mins (longer than the test data set, but shorter than the estimated 30mins in the vignette). I'll look through the data and see if the QC looks ok. These datasets were not large to begin with as our samples are rare so it makes sense that it wouldn't take that long.

Thanks so much!!! -Matt

mspivakov commented 8 months ago

Glad it worked!

On Fri, 19 Jan 2024, 19:49 RomeroMatt, @.***> wrote:

Oh my!!! What a mistake on my side of things! I changed my names based on your suggestions and it worked! However, it didn't take, but a few mins (longer than the test data set, but shorter than the estimated 30mins in the vignette). I'll look through the data and see if the QC looks ok. These datasets were not large to begin with as our samples are rare so it makes sense that it wouldn't take that long.

Thanks so much!!! -Matt

— Reply to this email directly, view it on GitHub https://github.com/RegulatoryGenomicsGroup/chicdiff/issues/20#issuecomment-1901014543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVB277JXL4RWP57Q7VAHTYPLE5LAVCNFSM6AAAAABBVXNHP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBRGAYTINJUGM . You are receiving this because you were mentioned.Message ID: @.***>