YevhenAkimov / DEBRA

DEBRA - DESeq-based Barcode Representation Analysis
2 stars 0 forks source link

[1] KS tests completed Error in order(res$mean) : argument 1 is not a vector #2

Open Owuorgpo opened 3 years ago

Owuorgpo commented 3 years ago

Hey, I Am trying to figure out why I get the above error every time I do not set the beta value. According to your description, threshold beta is a lower count limit for an independent filtering step (above which the NBD assumption is made). There is a defined process of estimating the beta threshold based on the barcode counts. So what is the effect of specifying a beta threshold (say I set it at (beta = 100) which seems to work for my dataset?

Thanks

YevhenAkimov commented 3 years ago

Hi, Barcodes with a low (actual) representation are subjected to the sampling effect, and their sequencing counts do not follow the NB distribution. For instance, if you have 10 copies of a particular barcode in your sample it will suffer a severe sampling error (after subsampling to, for instance, 4 subsamples the resulting numbers of this barcode in each sample could be e.g.: 1, 4, 3, 2, correspondingly, which is a pretty large error in terms of FC). Once you did the sequencing, these numbers will turn into, for example, 111,398,343,178 read counts, correspondingly, if your seq depth is about 100 reads per barcode. The beta value defines a read count threshold - barcodes with lower counts will be discarded. In the above example, I think, the beta value should be well above 1000.

If you know the number of input physical copies of barcodes (N) and the total number of barcode reads (R) you could define the beta threshold as (R/N)*C, where C is ~20.

Could you provide more information on the error you are getting (e.g. reproducible example)?

Sorry for the delayed response. Best, Yevhen

Owuorgpo commented 3 years ago

@YevhenAkimov barcode reads.txt This is the dataset I am trying to run through the default Debra, with modified and trended set as TRUE and method = "DESeq" used. Unless I set a beta threshold to any value, I keep getting the same error. [1] KS tests completed Error in order(res$mean) : argument 1 is not a vector

YevhenAkimov commented 3 years ago

@Owuorgpo Could you please attach your code, preferably, minimal reproducible example? Thanks.

Owuorgpo commented 3 years ago

Sure, here is the script as used. The error process stops at the first step of running DEBRA if the beta is not defined.

setwd() mat <- as.matrix(read.table("barcode.reads.txt", header = T)) mat <- mat[rowSums(mat < 50) <=2 , , drop = FALSE] control_names = c("Control1", "Control2", "Control3", "Control4") condition_names = c("case1", "case2", "case3", "case4") x <- DEBRA(mat, control_names=control_names, condition_names=condition_names, method="DESeq", modified = T, trended = T) drb <- testDRB(x) results=resultsDRB(drb)

YevhenAkimov commented 3 years ago

Hi, Thanks, I will check it asap.

YevhenAkimov commented 3 years ago

Hi, apparently it is a specific combination of sparsity and counts size which makes standard params fail. I would need to debug this.

For now, you can just use an extended script with, for example, max_window=50000.


drb=DEBRADataSet( counts =  mat,
                    control_names  = control_names,
                    condition_names = condition_names,
                    method=c("DESeq2(Wald)"),
                    trended=T)

drb= estimateBeta(drb,max_window=50000)
KS_plot(drb)
drb=testDRB(drb)
drb=independentFilteringDRB(drb)
res_debra=resultsDRB(drb)

This parameter max_window was used to reduce calculation times so that DEBRA does not assess the NB fitting for larger-sized read counts. Interestingly, that in your dataset you have the divergence from the NB model at read count sizes > 10000. Indeed, manual checking seems to agree with that, at least, read counts between replicas are too different. E.g.: BC57 | 161656 | 56389 | 163964 | 131756 Luckily you don't have a lot of these.

Are these actually replicas? If so, I think this could be due to unequal PCR conditions, like different input numbers of barcode copies per reaction or different cycling conditions.