WeiqiangZhou / BIRD

Big data Regression for predicting DNase I hypersensitivity
30 stars 5 forks source link

File not found when building model #3

Closed ShanSabri closed 4 years ago

ShanSabri commented 4 years ago

par_file.txt (all files exist)

NumLoci 307157
NumGene 25700
NumCluster  1500
NumBin  200
NumVar  7
DHCluterNum1    1000
DHCluterNum2    2000
GeneQuantile    /Users/shansabri/Dropbox/ErnstLab/decon-proj/gene_quantile.txt
GeneMean    /Users/shansabri/Dropbox/ErnstLab/decon-proj/gene_mean.txt
GeneSD  /Users/shansabri/Dropbox/ErnstLab/decon-proj/gene_sd.txt
ClusterIndex    /Users/shansabri/Dropbox/ErnstLab/decon-proj/cluster_idx.txt
DNaseMean   /Users/shansabri/Dropbox/ErnstLab/decon-proj/DNase_mean.txt
DNaseSD /Users/shansabri/Dropbox/ErnstLab/decon-proj/DNase_sd.txt
RegressionCoef  /Users/shansabri/Dropbox/ErnstLab/decon-proj/regress_coef.txt
RegressionPredictor /Users/shansabri/Dropbox/ErnstLab/decon-proj/regress_predictor.txt
GenomicLoci /Users/shansabri/Dropbox/ErnstLab/decon-proj/genomic_loci.txt
GeneName    /Users/shansabri/Dropbox/ErnstLab/decon-proj/gene_name.txt
DistanceMatrix  /Users/shansabri/Dropbox/ErnstLab/decon-proj/distance_matrix.txt
DHCluster   /Users/shansabri/Dropbox/ErnstLab/decon-proj/DH_cluster_matrix.txt
DHClusterCoef1  /Users/shansabri/Dropbox/ErnstLab/decon-proj/DH_cluster1000_coef.txt
DHClusterCoef2  /Users/shansabri/Dropbox/ErnstLab/decon-proj/DH_cluster2000_coef.txt
DHClusterPredictor1 /Users/shansabri/Dropbox/ErnstLab/decon-proj/DH_cluster1000_predictor.txt
DHClusterPredictor2 /Users/shansabri/Dropbox/ErnstLab/decon-proj/DH_cluster2000_predictor.txt

Building library

Reading ../../par_file.txt...
NumLoci=307157
NumGene=25700
NumCluster=1500
NumBin=200
NumVar=7
DHCluterNum1=1000
DHCluterNum2=2000
Sucessfully read in parameter file.
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/gene_quantile.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/gene_mean.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/gene_sd.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/cluster_idx.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/DNase_mean.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/DNase_sd.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/regress_coef.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/regress_predictor.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/genomic_loci.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/gene_name.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/distance_matrix.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/DH_cluster_matrix.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/DH_cluster1000_coef.txt...
Reading /Users/shansabri/Dropbox/ErnstLab/decon-proj/DH_cluster2000_coef.txt...
Error! File  not found!

There seems to be a hangup when trying to read DH_cluster1000_predictor.txt? This file does exist at the file path in the parameter file. The head of this file looks:

652 281 1267    1181    310 149 921
552 1060    709 749 358 1142    139
227 660 168 990 599 59  1435
1008    628 505 192 1186    1350    857
1242    1125    1080    1107    1356    156 750
1356    885 926 143 156 1079    198
1356    143 885 420 264 926 1075
113 227 503 119 1008    968 599
1324    1441    998 88  815 596 819
1129    287 600 1315    420 1037    151

Any thoughts?

EDIT: Digging into the source it seem as though BIRD is expecting 3 coef and corresponding predictor files. Is there a particular reason why?

WeiqiangZhou commented 4 years ago

Could you try adding one more level for the DH cluster model (e.g., "DHCluterNum3 5000") and the required files (e.g, "DHClusterCoef3 DH_cluster5000_coef.txt" and "DHClusterPredictor3 DH_cluster5000_predictor.txt")? The current build library function requires three levels of DH clusters.

ShanSabri commented 4 years ago

Yes, the model was built properly using 3 cluster levels. I'm curious to know why there are 3 clustering assignment requirements. How exactly does BIRD use this information?

WeiqiangZhou commented 4 years ago

BIRD is designed to combine DH cluster level prediction with locus level prediction based on our observation that DH cluster activities are easy to predict. If you would like to know the details about how BIRD combines these predictions, you can check out the Methods section "Model aggregation" in our first BIRD paper: Zhou W, Sherwood B, Ji Z, Xue Y, Du F, Bai J, Ying M, Ji H. Genome-wide Prediction of DNase I Hypersensitivity Using Gene Expression. Nature Communications 8, 1038 (2017). The three clustering assignment is an initial setting in the BIRD model which represent the typical levels of clusters for the DH data. It is hard-coded in the current version of the build model function. I will update the code to make this more flexible in a future version. Thanks for pointing this out.

ShanSabri commented 4 years ago

Great, thank you for the clarification.