YichaoOU / easy_prime

prime editor gRNA design tool based on ensemble learning
MIT License
3 stars 2 forks source link

Raw data of PE_Data_collection #3

Open mathinic opened 3 years ago

mathinic commented 3 years ago

Dear Yichao,

Thanks for sharing your repo with that much useful data/scripts! :) Would you by chance also have the raw-data of your the .csv files stored under https://github.com/YichaoOU/easy_prime/tree/master/PE_data_collection? As I understand, these are processed files based on the different datasets but initially there must have been more raw-format data such as original vs edited sequence?

Of course I could also try to dig them out from the original papers (such as Anzalone et al.) but if you already have them in a nice format, it would save a lot of time for me. :)

Best, Nicolas

YichaoOU commented 3 years ago

Hi Nicolas,

No problem!

1. Anzalone 2019.

It took me more than a month to get this original PE data for ML analysis mainly because of the poor description. They only provided raw fastq and all the "sequences". However, it is very difficult to get the target mutation and the observed editing efficiency. Also, sometimes pegRNA-ngRNA are "mis-paired" in their supplementary file.

As we stated in our easy-prime paper, we made a bioinformatics pipeline to identify all this missing information.

You can find the "raw-format" I used (the file before going into the ML analysis): https://github.com/YichaoOU/easy_prime/blob/master/figure_rep/raw_table_XY_for_training.csv

I also have the notebook showing that I can replicate their figures: https://github.com/YichaoOU/easy_prime/blob/master/figure_rep/rep.ipynb

I also uploaded the rawX file for this data: https://github.com/YichaoOU/easy_prime/blob/master/figure_rep/rawX_Y.csv

rawX format specification is provided in the main README.

2. DeepPE and PrimeDesign data

They have provided well-annotated data in their supplementary tables. For your convenience, I uploaded their raw format in the PE_data_collection folder. Please see the README file in that folder.

If you used these processed "raw format" in your projects, we kindly request you to cite our paper even if you didn't directly use easy-prime.

Thanks, Yichao

mathinic commented 3 years ago

Hi Yichao

This is exactly what I was looking for, thanks! I especially feared having to go through all the Anzalone fastq data myself, but with the very nicely annotated raw_table_XY_for_training.csv table on hand this is not necessary anymore. :star_struck:

Thanks so much, I'm sure this will also be of help for many others interested in well-annotated prime-editing data.

Best, Nicolas

mathinic commented 3 years ago

One more question: Can you explain what the difference between the columns amplicon_seq, original_seqand reference_seqare in the Anzalone .csv file? Is reference_seq the correct one to use together with target_mutation_correction?

YichaoOU commented 3 years ago

Yes, using reference_amplicon and target_mutation_correction, you can get original sequence and edited sequence.

amplicon_seq and original_seq should be some intermediate sequences. Because their paper didn't provide amplicon sequence, we first mapped fastq to get possible amplicon sequence (original_seq), but as you can see, this will not be accurate. We used the original seq and ran crispresso multiple times, then if it failed or with an "incorrect editting efficiency", then that means the given amplicon sequence is not correct. Some manual tuning here to get amplicon_seq. Lastly, some mistakes were fixed, then we got reference_amplicon.

mathinic commented 2 years ago

Hi Yichao,

I have a follow-up question for this issue.

I'm trying to perform some predictions with your PE2 model (therefore can't use the easy-prime.cc model since it only gives a score for PE3). In the notebook PrimeDesign_sites_featurization-clean.ipynb I could find the code for achieving this, but couldn't get to the same SPROUT values which are in PrimeDesign.20sites.feature_matrix.csv.

Could you explain how you calculated these values (I assume based on https://github.com/amirmohan/SPROUT)?

Example: How did you get a SPROUT value for the first row (ATP7B, delG) of 51.30936? When I run the sequence "CCAGCAATACCTTTTTCTGCGGG" with the SPROUT code, I get the following output which does not include this value:

image

Thanks a lot for your help! :raised_hands:

mathinic commented 2 years ago

When digging a bit deeper, I think I realized that the SPROUT score is not needed for PE2 prediction which solves the problem from above.

However, when I look at Anzalone_2019.feature_matrix.csv, I only see 10 columns for RNAfolding but there are 15 columns based on PrimeDesign_sites_featurization-clean.ipynb. Are the 10 columns in Anzalone_2019.feature_matrix.csv just the first 10 columns from the PrimeDesign Juptyter Notebook, or how can I get the exact input that I need to predict PE2 efficiencies with your PE2 Prediction Jupyter Notebook? :)

YichaoOU commented 2 years ago

Hi Nicolas,

For the SPROUT question, we used the insertion%+deletion% as the SPROUT score. In the original code, there is a deletion frac frac_total_del function but was not printed to the user. You can find my modified code here: https://gist.github.com/YichaoOU/df288a1f80c1ea659678773380a8361b

But you are right, in the final stage, we didn't use SPROUT, mainly because we used a lot of data from the DeepPE paper, and the DeepSpCas9 (this algorithm was developed from the same lab) score is so good for DeepPE.

For the RNAfold question, we initially generated RNA folding disruption score for 16 positions (the 15 columns you were saying is probably 0 to 15, so it is 16). In the model training stage, we only used the first 10 positions, so the exact input to use the PE2 model would be just the first 10 columns.

Let me know if you need anything else.

Thanks, Yichao

mathinic commented 2 years ago

Thanks a lot!

I have one last question regarding the definition of the Target_end_flank in Anzalone_2019.feature_matrix.csv. As I understand this value corresponds to the length of the RTT which comes after the edit, correct?

Somehow I cannot reproduce this for insertion and deletion edits in the Anzalone feature_matrix, because I always get 1 bp higher value (when checking the individual sequences in raw_table_XY_for_training.csv). Below is one example for the CTT insertion which has the value 6 in the feature_matrix but 7 when I check the sequence (image below).

To be consistent with the training data/model, do you recommend to use the length as in the image below or substract 1bp for insertions/deletions (as in the feature_matrix file)?

Thanks & best wishes, Nicolas

image

YichaoOU commented 2 years ago

Target end flank is the number of nucleotides to the end of RTT sequence, not including the target mutation. I guess in this example, it is T to CTT? so that's why it is 6?

mathinic commented 2 years ago

I think actually that in this example ('EDFIG10_HEK293T_HEK3_1CTTINS_PE3') it is a CTT insertion (not a replacement of T to CTT), or am I missing something? This is the corresponding RTT sequence from the raw .csv: GCCATCAAAG

I also have another example with a GTA insertion at position 1 (EDFIG10_HEK293T_RNF2_1GTAINS_PE3, image below) where I count an overhang of 14 but see the value 13 in the feature_matrix.csv. Here, the RTT sequence is: AACGAACACCTCAGTAC

If this is indeed the case for all insertions/deletions, it wouldn't take anything away from the easy-prime model/performance since everything is still consistent when using the usual input via easy-prime.cc website. I'm only asking because I'm preparing my data manually (to get PE2 predictions) and would like to keep it consistent with your training dataset. :)

image

YichaoOU commented 2 years ago

Thanks for the more detailed information!

In all these insertion examples, I think your numbers are correct and easy-prime code has "miscalculated" them. For example, for this CTT insertion example, GCCATCAAAG, I treat GCCATCAaag, this A as the target mutation (reference allele in a vcf format), and the Aaag as the alternative allele. That's why my calculation is 6. But I agree, the correct number is 7.