bm2-lab / DeepCRISPR

Apache License 2.0
56 stars 31 forks source link

Encoding for off-target prediction #1

Closed JudoWill closed 5 years ago

JudoWill commented 5 years ago

I'm wondering how we're supposed to encode epigenetic information for passing into the "off-target" version of the pipeline. The model wants both the sgRNA and the target as 8-channel vectors that include epigenetetic information ... what should we put in those channels on the sgRNA? Should they be zeros? Should they match what is in the target?

I'm trying to run the off-target scoring algorithm and I'm not getting any values that make sense. Even perfect matches between the sgRNA and the target are giving extremely low values.

off_target_model_dir = '../DeepCRISPR/trained_models/offtar_pt_cnn/'
is_reg = False
dcmodel_off = DCModelOfftar(sess, off_target_model_dir, is_reg)
seq = 'GTTAGACCAGATCTGAGCCTTGG'
encoded = encoder([seq]) #Does one-hot encoding to correct shape, all epigenetic factors are zero
res = dcmodel_off.offtar_predict(encoded, encoded)
# res = 1.4075727e-35

I've tried all combinations of epi factors and the scores are still very low. Since the sgRNA and target are exact matches I would expect a 100% score, or at least something reasonably high for at least one combination of epi factors.

Am I encoding things wrong? Am I misunderstanding your paper and it is actually low values that are indicative of off-target activity?

I've run this across a large collection of sgRNAs across various targets to compare this with the CFD score. ds_cfd Here I've plotted the maximum DeepCRISPR score across all of the different epi-conditions for each sgRNA/target pair vs the CFD score for the same pair. No DeepCRISPR score is above 1E-7 Something is clearly going wrong.

Can you help me understand what I'm doing wrong here. I am able to get sensible results out of these sgRNAs out of the on-target pipeline.

JudoWill commented 5 years ago

I also checked the off-target regression model. It shows more variation but has a negative correlation with the CFD score. Also unexpected. ds_cfd

MichaelChuai commented 5 years ago

what should we put in those channels on the sgRNA? Should they be zeros? Should they match what is in the target?

The epi channels should exactly be the channels as we presented, and so is the orders of these channels. The value of the epi channels should be 1 or 0, 1 for signal detected, otherwise 0. And the chromosome coordinates should be the same with the sequence.

If there is sth wrong with the encoding process, the result could not be good...

JudoWill commented 5 years ago

Hmm, maybe I'm not explaining my question.

Say we look at the EXM1 sgRNA (which is in your training dataset). It is intended to hit a location on Chr2 but GUIDESeq (and your dataset) show that it has off-target hits across Chr1, 3, 5, 7, 10, 13, 15, and X.

It has an sgRNA sequence of: GAGTCCGAGCAGAACAAGAA(NGG)

I can get the sequence and epi factors for the location of Chr1:

GAGTCCGAGCAGAACAAGAACGG
# encoded
01000001001011011011000 #A
00001100010000100000100 #C
10100010100100000100011 #G
00010000000000000000000 #T
00000000000000000000000
11111111111111111111111
00000000000000000000000
11111111111111111111111

and at Chr3:

GAGTCTGAGCAGAACAAGAATGG
# encoded
01000001001011011011000 #A
00001000010000100000000 #C
10100010100100000100011 #G
00010100000000000000100 #T
11111111111111111111111
00000000000000000000000
00000000000000000000000
11111111111111111111111

So, that's how to encode the x_ot_off_target (after reshaping to [2,8,1,23]) you describe in the README.md. I'm unclear how to encode the x_sg_off_target that pairs with these. The sgRNA is a synthetic construct, it has no epigenetic factors. And, how do I encode the "pam" NGG part of the sgRNA? Do I leave it blank?

So, would I encode the chr1 pair like:

GAGTCCGAGCAGAACAAGAANGG
# encoded
01000001001011011011000 #A
00001100010000100000000 #C
10100010100100000100011 #G
00010000000000000000000 #T
00000000000000000000000
00000000000000000000000
00000000000000000000000
00000000000000000000000

That gives me a score of 0 using the offtar_pt_cnn model. Or do I use the epi data from the chr1 target location? Along with the C in the PAM from the target.

GAGTCCGAGCAGAACAAGAACGG
# encoded
01000001001011011011000 #A
00001100010000100000100 #C
10100010100100000100011 #G
00010000000000000000000 #T
00000000000000000000000
11111111111111111111111
00000000000000000000000
11111111111111111111111

That also gives me a score of 0.

I have programatically tried all possible combinations of epi factors in both the x_ot_off_target and x_sg_off_target arrays and I can't get a value that isn't 0. These are exact matches, I don't see why they aren't getting higher scores. I've even tried introducing some mismatches to see if the model just performed poorly on exact hits, but those also gave predictions of 0.

MichaelChuai commented 5 years ago

OK, I see the problem.

I suggest traversing over the whole genome to find the off-target potential just like the provided sequence. Through this way, you can use the coordinates to get epi feature from ENCODE.

If you don't provide the epigenetic feature for both on-target gRNA and off-target gRNA, it cannot function normally.

JudoWill commented 5 years ago

No, that's not my question. I understand how to find the target sites ... that's how I got these. How do I encode the sgRNA portion of the pair? By "on-target gRNA" do you mean the epi factors at intended target site? The chr2 location in my example?

Even with that, why do none of the encoded sequences I describe give any positive results? I've encoded all possible epi factors and NONE give a value greater than zero. One of those (by definition) must be the "right" encoding.

Can you make a gist that encodes a pair from your dataset and provides a positive result? Maybe by seeing a working example I can understand want you mean.

MichaelChuai commented 5 years ago

Well, positive off-target sites is usually much fewer than negative sites. Also, the whole combination of epi features is too large to perform (2^92).

Still, I suggest using sequences that one genome has.

JudoWill commented 5 years ago

Well, positive off-target sites is usually much fewer than negative sites. But they're the most important. I also know (from wet-lab data) that transfecting with this sgRNA truly induces mutations at these locations with those epi factors. That's why I was questioning at a negative result.

Also, the whole combination of epi features is too large to perform (2^92). I know the true epi factors at these locations. I just couldn't know whether DNAse hypersensitivity was coded as a 1 or 0, RRBS peaks were coded as 1 or 0, etc. That is not described anywhere. So it is only 16 combinations. I was able to figure this out from the training data.

You're still not understanding my question ... but I was able to figure it out looking at your training data. I'm assuming this is a terminology barrier between biology and comp-sci. Here's how my computational biologist brain thinks about it. There aren't "on-target sgRNAs" and "off-target sgRNAs". We make an sgRNA. The sgRNA itself is synthetic, it has no epigenetic factors. It is designed to cleave a specific genomic location, which has epigenetic factors (what your on-target model uses). It may also bind/cleave to other off-target sites on the genome, which have their own epi factors.

So, the sgRNA off-target prediction problem is a one-to-many relationship. In your offtar classification dataset you only have 30 unique sgRNAs sequences across 153233 lines. I wanted to know for a specific sgRNA when looking across the (potentially many) combinations of other targets whether the sgRNA epi factors (the x_sg_off_target variable) changed as the genomic target (the x_ot_off_target variable) changed. Or whether it remained the same.

Looking through the classification data in the repo (hek293t.epiotrt) there are 132914 lines but only 18 unique sequences across the first sequence column. Each instance of the 18 sequences have the same epi factors across the x_sg_off_target columns. It is only the x_ot_off_target that change. There are some duplicates in the x_ot_off_target columns but I assume that is because different off-target sites may have the same sequence and epi-factors. Since you don't include the mapping locations, it is hard for me to confirm. From the paper and the description on the tool's website it looks like you get this from the intended target of the sgRNA and then map these same factors across all potential off-target sites in the x_sg_off_target columns and then vary the x_ot_off_target with the epi factors at those locations.

I still get about a 60% false negative rate (ROC: 0.57) in my own dataset but now that I understand your encoding I can fine-tune the model using my own data. Since I'm working in a non-human system a little adjustment is expected.