Not clear what data is mandatory and what can be deleted

michael-kotliar commented 4 years ago

Hello, If I want to run Leopard with my own data, what should I place in Leopard/data/. Also, I couldn't find anything about avg.bigwig file, which is hardcoded in predict.py

Thanks

Hongyang449 commented 4 years ago

This avg.bigwig file can be downloaded from this folder: DNase-seq Basically, it is the average signal from all the cell types under consideration. In our case, it is the average from 13 cell types from the above directory. If you want to run Leopard on your own data (e.g. cell types outside the 13 provided cell types), I suggest that you quantile normalize your data to the average first before running predictions.

Hongyang449 commented 4 years ago

In your case for DNase-seq data, I think you only need to download the avg.bigwig and use your own DNase-seq data of interest. The DNA sequence files are always needed. The ChIP-seq data are optional. You only need them if you want to re-train/adapt our models or comparing predictions with experimental observations.

Hongyang449 commented 4 years ago

Thanks for these comments! I've updated the Github README file to clarify this.

michael-kotliar commented 4 years ago

Thank you for the detailed answer. What is the best way to quantile normalize my bigWig? Are there any tools available for that? I use bedtools genomecov to convert my BAM files to bigWigs with scaling factor equal to 1000000/mapped_reads_number. Do I still need to do quantile normalization after that?

Hongyang449 commented 4 years ago

Yes, I think you still need to quantile normalize the data. I have some in-house quantile normalization code for your reference. Specifically, there are two steps: (i) subsample a subset of signals from both your input and the reference (avg.bigwig is the reference in this case). (ii) adapt and run the following function:

def anchor (input,sample,ref): # input 1d array
    sample.sort()
    ref.sort()
    # 0. create the mapping function
    index=np.array(np.where(np.diff(sample)!=0))+1
    index=index.flatten()
    x=np.concatenate((np.zeros(1),sample[index])) # domain
    y=np.zeros(len(x)) # codomain
    for i in np.arange(0,len(index)-1,1):
        start=index[i]
        end=index[i+1]
        y[i+1]=np.mean(ref[start:end])
    i+=1
    start=index[i]
    end=len(ref)
    y[i+1]=np.mean(ref[start:end])
    # 1. interpolate
    output=np.interp(input, x, y)
    # 2. extrapolate
    degree=1 # degree of the fitting polynomial
    num=10 # number of positions for extrapolate
    f1=np.poly1d(np.polyfit(sample[-num:],ref[-num:],degree))
    f2=np.poly1d(np.polyfit(sample[:num],ref[:num],degree))
    output[input>sample[-1]]=f1(input[input>sample[-1]])
    output[input<sample[0]]=f2(input[input<sample[0]])
    return output

Here the input is the original signal of e.g. an entire chromosome. The sample and refare the subsampled subset to estimate the overall distribution of genome-wide signals. Hope this code is helpful for now. I will upgrade Leopard and add this function for quantile normalization when I have time later.

michael-kotliar commented 4 years ago

Cool, thanks, I'll try to use the function you provided

zj-liu commented 4 years ago

Yes, I think you still need to quantile normalize the data. I have some in-house quantile normalization code for your reference. Specifically, there are two steps: (i) subsample a subset of signals from both your input and the reference (avg.bigwig is the reference in this case). (ii) adapt and run the following function:
def anchor (input,sample,ref): # input 1d array
    sample.sort()
    ref.sort()
    # 0. create the mapping function
    index=np.array(np.where(np.diff(sample)!=0))+1
    index=index.flatten()
    x=np.concatenate((np.zeros(1),sample[index])) # domain
    y=np.zeros(len(x)) # codomain
    for i in np.arange(0,len(index)-1,1):
        start=index[i]
        end=index[i+1]
        y[i+1]=np.mean(ref[start:end])
    i+=1
    start=index[i]
    end=len(ref)
    y[i+1]=np.mean(ref[start:end])
    # 1. interpolate
    output=np.interp(input, x, y)
    # 2. extrapolate
    degree=1 # degree of the fitting polynomial
    num=10 # number of positions for extrapolate
    f1=np.poly1d(np.polyfit(sample[-num:],ref[-num:],degree))
    f2=np.poly1d(np.polyfit(sample[:num],ref[:num],degree))
    output[input>sample[-1]]=f1(input[input>sample[-1]])
    output[input<sample[0]]=f2(input[input<sample[0]])
    return output
Here the input is the original signal of e.g. an entire chromosome. The sample and refare the subsampled subset to estimate the overall distribution of genome-wide signals. Hope this code is helpful for now. I will upgrade Leopard and add this function for quantile normalization when I have time later.

Hi,

May I ask how did you generate the avg.bigwig file initially? Any preprocessing to the cell line bigwig files before computing the average?

Thanks.

Hongyang449 commented 4 years ago

I'm currently working on the data preprocessing part and I'll update the github with some new functions about quantile normalization and creating the reference next week. Thank you for your feedback!

Hongyang449 commented 4 years ago

Hi @michael-kotliar @zj-liu ,

I've added the code for quantile normalization and calculating the average. The quantile normalization is always necessary, since DNase-seq data of different cell lines can be very different due to e.g. sequencing biases, read depth... But the average is quite robust and in general you don't need to re-calculate the average. You can directly download the "avg.bigwig" from our website. The detailed instructions can be found here: quantile normalization Let me know if you have any questions, suggestions or find any bugs. Thanks again!

Best, Hongyang

michael-kotliar commented 4 years ago

Thanks!

GuanLab / Leopard

Not clear what data is mandatory and what can be deleted #3