Open michael-kotliar opened 4 years ago
This avg.bigwig
file can be downloaded from this folder:
DNase-seq
Basically, it is the average signal from all the cell types under consideration. In our case, it is the average from 13 cell types from the above directory.
If you want to run Leopard on your own data (e.g. cell types outside the 13 provided cell types), I suggest that you quantile normalize your data to the average first before running predictions.
In your case for DNase-seq data, I think you only need to download the avg.bigwig
and use your own DNase-seq data of interest.
The DNA sequence files are always needed.
The ChIP-seq data are optional. You only need them if you want to re-train/adapt our models or comparing predictions with experimental observations.
Thanks for these comments! I've updated the Github README file to clarify this.
Thank you for the detailed answer. What is the best way to quantile normalize my bigWig? Are there any tools available for that? I use bedtools genomecov
to convert my BAM files to bigWigs with scaling factor equal to 1000000/mapped_reads_number
. Do I still need to do quantile normalization after that?
Yes, I think you still need to quantile normalize the data. I have some in-house quantile normalization code for your reference. Specifically, there are two steps: (i) subsample a subset of signals from both your input and the reference (avg.bigwig is the reference in this case). (ii) adapt and run the following function:
def anchor (input,sample,ref): # input 1d array
sample.sort()
ref.sort()
# 0. create the mapping function
index=np.array(np.where(np.diff(sample)!=0))+1
index=index.flatten()
x=np.concatenate((np.zeros(1),sample[index])) # domain
y=np.zeros(len(x)) # codomain
for i in np.arange(0,len(index)-1,1):
start=index[i]
end=index[i+1]
y[i+1]=np.mean(ref[start:end])
i+=1
start=index[i]
end=len(ref)
y[i+1]=np.mean(ref[start:end])
# 1. interpolate
output=np.interp(input, x, y)
# 2. extrapolate
degree=1 # degree of the fitting polynomial
num=10 # number of positions for extrapolate
f1=np.poly1d(np.polyfit(sample[-num:],ref[-num:],degree))
f2=np.poly1d(np.polyfit(sample[:num],ref[:num],degree))
output[input>sample[-1]]=f1(input[input>sample[-1]])
output[input<sample[0]]=f2(input[input<sample[0]])
return output
Here the input
is the original signal of e.g. an entire chromosome. The sample
and ref
are the subsampled subset to estimate the overall distribution of genome-wide signals.
Hope this code is helpful for now. I will upgrade Leopard and add this function for quantile normalization when I have time later.
Cool, thanks, I'll try to use the function you provided
Yes, I think you still need to quantile normalize the data. I have some in-house quantile normalization code for your reference. Specifically, there are two steps: (i) subsample a subset of signals from both your input and the reference (avg.bigwig is the reference in this case). (ii) adapt and run the following function:
def anchor (input,sample,ref): # input 1d array sample.sort() ref.sort() # 0. create the mapping function index=np.array(np.where(np.diff(sample)!=0))+1 index=index.flatten() x=np.concatenate((np.zeros(1),sample[index])) # domain y=np.zeros(len(x)) # codomain for i in np.arange(0,len(index)-1,1): start=index[i] end=index[i+1] y[i+1]=np.mean(ref[start:end]) i+=1 start=index[i] end=len(ref) y[i+1]=np.mean(ref[start:end]) # 1. interpolate output=np.interp(input, x, y) # 2. extrapolate degree=1 # degree of the fitting polynomial num=10 # number of positions for extrapolate f1=np.poly1d(np.polyfit(sample[-num:],ref[-num:],degree)) f2=np.poly1d(np.polyfit(sample[:num],ref[:num],degree)) output[input>sample[-1]]=f1(input[input>sample[-1]]) output[input<sample[0]]=f2(input[input<sample[0]]) return output
Here the
input
is the original signal of e.g. an entire chromosome. Thesample
andref
are the subsampled subset to estimate the overall distribution of genome-wide signals. Hope this code is helpful for now. I will upgrade Leopard and add this function for quantile normalization when I have time later.
Hi,
May I ask how did you generate the avg.bigwig file initially? Any preprocessing to the cell line bigwig files before computing the average?
Thanks.
I'm currently working on the data preprocessing part and I'll update the github with some new functions about quantile normalization and creating the reference next week. Thank you for your feedback!
Hi @michael-kotliar @zj-liu ,
I've added the code for quantile normalization and calculating the average. The quantile normalization is always necessary, since DNase-seq data of different cell lines can be very different due to e.g. sequencing biases, read depth... But the average is quite robust and in general you don't need to re-calculate the average. You can directly download the "avg.bigwig" from our website. The detailed instructions can be found here: quantile normalization Let me know if you have any questions, suggestions or find any bugs. Thanks again!
Best, Hongyang
Thanks!
Hello, If I want to run Leopard with my own data, what should I place in
Leopard/data/
. Also, I couldn't find anything aboutavg.bigwig
file, which is hardcoded inpredict.py
Thanks