WeiqiangZhou / BIRD

Big data Regression for predicting DNase I hypersensitivity
29 stars 5 forks source link

differential DHSs #10

Closed gianfilippo closed 3 years ago

gianfilippo commented 3 years ago

Hi,

could you please suggest how to do a differential DHSs analysis ?

I can see that the output is a matrix of DH levels as gene by sample. I understand the "level" is normalized with a default maximum.

I processed case vs control samples. Should I test each 200bp region for difference in mean, say, using a t-test ?

Tools that process experimental DHSs data seem to rely on read count, not available here.

Thanks Gianfilippo

WeiqiangZhou commented 3 years ago

You could try the limma R package which is originally designed for performing differential tests of log-transformed gene expression data. It should also work for the prediction output from BIRD.

gianfilippo commented 3 years ago

Hi, ok I will give it a try. Meanwhile I found something a but unexpected. I have RNAseq from K562 controls and after GATA1 knockdown. The resulting predicted DHSs profiles show a 0.995 and up correlation coefficient. Is this expected ? could you please comment on it ?

WeiqiangZhou commented 3 years ago

Is the RNA-seq data also highly correlated? One possibility is that the difference between the control and knockdown samples is very small and the training data of the BIRD model didn't capture such difference. I would suggest you take a look at the differential peaks and see if the prediction contains useful information.

gianfilippo commented 3 years ago

Hi,

I am looking at differential peaks. I normalized the DHS levels by (log dhs + 1) and tested for differences.

One concern I have is how to filter out bins with low DHS level. The raw levels range from 0 to ~9.3. What would you consider as low DHS level ?

Thanks Gianfilippo

On Fri, Jul 9, 2021 at 8:44 PM Weiqiang Zhou @.***> wrote:

Is the RNA-seq data also highly correlated? One possibility is that the difference between the control and knockdown samples is very small and the training data of the BIRD model didn't capture such difference. I would suggest you take a look at the differential peaks and see if the prediction contains useful information.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/WeiqiangZhou/BIRD/issues/10#issuecomment-877385700, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPSFVEGRBQ2NNZM2LPKJLDTW47SRANCNFSM474OWR2Q .

WeiqiangZhou commented 3 years ago

Have you explored the distribution of the DHS levels? I would try using 2 or the average value from all loci as the cutoff. By the way, the prediction output is supposed to be on log scale, you don't have to log transform the output.

gianfilippo commented 3 years ago

Hi,

I tried a couple of cutoffs, to reduce the number of tests. Nevertheless, it does not improve much on the number of significant DHEs level differences after correcting the p-values. Beyond what is significant, what I see, is a mild DHS level ratio between (GATA1) perturbed and controls, and high level of correlation between the DHS tracks. Yet, the GATA1 perturbed vs control gives me a good deal of differentially expressed genes. This is why I am a bit puzzled.

I am wondering whether I am supposed to filter the input RNAseq data, removing, sample by sample, lowly expressed genes, or BIRD handles that.

Thanks Gianfilippo

On Sat, Jul 17, 2021 at 6:00 PM Weiqiang Zhou @.***> wrote:

Have you explored the distribution of the DHS levels? I would try using 2 or the average value from all loci as the cutoff.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/WeiqiangZhou/BIRD/issues/10#issuecomment-881919268, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPSFVEDHDD7WVMT7ZTQUD3TYGSJHANCNFSM474OWR2Q .

WeiqiangZhou commented 3 years ago

I think this may be caused by the limitation of the training samples. The difference between the perturbed samples and control samples in your dataset is not well represented in the current training dataset. Therefore, the prediction output will be similar and represent the major characteristics of the cell type. BIRD will not exclude the lowly expressed genes since they also contain information about the sample. What's the Spearman's correlation between the perturbed samples and the control samples? If that's is high, the prediction output from BIRD will be very similar.

WeiqiangZhou commented 3 years ago

Have you checked the enriched motifs in the top differential regions? In my experience, there may still be some useful information even if the fold change is mild.

gianfilippo commented 3 years ago

RNAseq based correlation levels are also high, but the number of differentially expressed genes I get is high (more than 3000 genes in general and more than 1000 with high FC), suggesting substantial transcriptome perturbation. Perhaps, as you are suggesting, BIRD capture the difference only partially. The MDS using the predicted DHSs is reasonable, in spite of the similarity

Best Gianfilippo

On Sat, Jul 17, 2021 at 6:44 PM Weiqiang Zhou @.***> wrote:

I think this may be caused by the limitation of the training samples. The difference between the perturbed samples and control samples in your dataset is not well represented in the current training dataset. Therefore, the prediction output will be similar and represent the major characteristics of the cell type. BIRD will not exclude the lowly expressed genes since they also contain information about the sample. What's the Spearman's correlation between the perturbed samples and the control samples? If that's is high, the prediction output from BIRD will be very similar.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/WeiqiangZhou/BIRD/issues/10#issuecomment-881924859, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPSFVAVRGMYNWGEAW2D7R3TYGXNLANCNFSM474OWR2Q .

gianfilippo commented 3 years ago

not yet, because they are too few, 3 if I do not filter, 11 with filtering.

On Sat, Jul 17, 2021 at 6:49 PM Weiqiang Zhou @.***> wrote:

Have you checked the enriched motifs in the top differential regions? In my experience, there may still be some useful information even if the fold change is mild.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/WeiqiangZhou/BIRD/issues/10#issuecomment-881925537, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPSFVHJDNM5QNI7YCYYS6LTYGYBHANCNFSM474OWR2Q .

WeiqiangZhou commented 3 years ago

not yet, because they are too few, 3 if I do not filter, 11 with filtering. On Sat, Jul 17, 2021 at 6:49 PM Weiqiang Zhou @.***> wrote: Have you checked the enriched motifs in the top differential regions? In my experience, there may still be some useful information even if the fold change is mild. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPSFVHJDNM5QNI7YCYYS6LTYGYBHANCNFSM474OWR2Q .

In such a case, I would try using a loose cutoff or just use the top differential regions based on FC.