Open mathinic opened 3 years ago
Hi Nicolas,
No problem!
It took me more than a month to get this original PE data for ML analysis mainly because of the poor description. They only provided raw fastq and all the "sequences". However, it is very difficult to get the target mutation and the observed editing efficiency. Also, sometimes pegRNA-ngRNA are "mis-paired" in their supplementary file.
As we stated in our easy-prime paper, we made a bioinformatics pipeline to identify all this missing information.
You can find the "raw-format" I used (the file before going into the ML analysis): https://github.com/YichaoOU/easy_prime/blob/master/figure_rep/raw_table_XY_for_training.csv
I also have the notebook showing that I can replicate their figures: https://github.com/YichaoOU/easy_prime/blob/master/figure_rep/rep.ipynb
I also uploaded the rawX
file for this data: https://github.com/YichaoOU/easy_prime/blob/master/figure_rep/rawX_Y.csv
rawX
format specification is provided in the main README.
They have provided well-annotated data in their supplementary tables. For your convenience, I uploaded their raw format in the PE_data_collection
folder. Please see the README file in that folder.
If you used these processed "raw format" in your projects, we kindly request you to cite our paper even if you didn't directly use easy-prime.
Thanks, Yichao
Hi Yichao
This is exactly what I was looking for, thanks!
I especially feared having to go through all the Anzalone fastq data myself, but with the very nicely annotated raw_table_XY_for_training.csv
table on hand this is not necessary anymore. :star_struck:
Thanks so much, I'm sure this will also be of help for many others interested in well-annotated prime-editing data.
Best, Nicolas
One more question:
Can you explain what the difference between the columns amplicon_seq
, original_seq
and reference_seq
are in the Anzalone .csv file? Is reference_seq
the correct one to use together with target_mutation_correction
?
Yes, using reference_amplicon
and target_mutation_correction
, you can get original sequence and edited sequence.
amplicon_seq
and original_seq
should be some intermediate sequences. Because their paper didn't provide amplicon sequence, we first mapped fastq to get possible amplicon sequence (original_seq
), but as you can see, this will not be accurate. We used the original seq and ran crispresso
multiple times, then if it failed or with an "incorrect editting efficiency", then that means the given amplicon sequence is not correct. Some manual tuning here to get amplicon_seq
. Lastly, some mistakes were fixed, then we got reference_amplicon
.
Hi Yichao,
I have a follow-up question for this issue.
I'm trying to perform some predictions with your PE2 model (therefore can't use the easy-prime.cc model since it only gives a score for PE3). In the notebook PrimeDesign_sites_featurization-clean.ipynb I could find the code for achieving this, but couldn't get to the same SPROUT values which are in PrimeDesign.20sites.feature_matrix.csv.
Could you explain how you calculated these values (I assume based on https://github.com/amirmohan/SPROUT)?
Example: How did you get a SPROUT value for the first row (ATP7B, delG) of 51.30936? When I run the sequence "CCAGCAATACCTTTTTCTGCGGG" with the SPROUT code, I get the following output which does not include this value:
Thanks a lot for your help! :raised_hands:
When digging a bit deeper, I think I realized that the SPROUT score is not needed for PE2 prediction which solves the problem from above.
However, when I look at Anzalone_2019.feature_matrix.csv, I only see 10 columns for RNAfolding but there are 15 columns based on PrimeDesign_sites_featurization-clean.ipynb. Are the 10 columns in Anzalone_2019.feature_matrix.csv just the first 10 columns from the PrimeDesign Juptyter Notebook, or how can I get the exact input that I need to predict PE2 efficiencies with your PE2 Prediction Jupyter Notebook? :)
Hi Nicolas,
For the SPROUT question, we used the insertion%+deletion% as the SPROUT score. In the original code, there is a deletion frac frac_total_del
function but was not printed to the user. You can find my modified code here: https://gist.github.com/YichaoOU/df288a1f80c1ea659678773380a8361b
But you are right, in the final stage, we didn't use SPROUT, mainly because we used a lot of data from the DeepPE paper, and the DeepSpCas9 (this algorithm was developed from the same lab) score is so good for DeepPE.
For the RNAfold question, we initially generated RNA folding disruption score
for 16 positions (the 15 columns
you were saying is probably 0 to 15, so it is 16). In the model training stage, we only used the first 10 positions, so the exact input to use the PE2 model
would be just the first 10 columns.
Let me know if you need anything else.
Thanks, Yichao
Thanks a lot!
I have one last question regarding the definition of the Target_end_flank in Anzalone_2019.feature_matrix.csv. As I understand this value corresponds to the length of the RTT which comes after the edit, correct?
Somehow I cannot reproduce this for insertion and deletion edits in the Anzalone feature_matrix, because I always get 1 bp higher value (when checking the individual sequences in raw_table_XY_for_training.csv). Below is one example for the CTT insertion which has the value 6 in the feature_matrix but 7 when I check the sequence (image below).
To be consistent with the training data/model, do you recommend to use the length as in the image below or substract 1bp for insertions/deletions (as in the feature_matrix file)?
Thanks & best wishes, Nicolas
Target end flank is the number of nucleotides to the end of RTT sequence, not including the target mutation. I guess in this example, it is T to CTT? so that's why it is 6?
I think actually that in this example ('EDFIG10_HEK293T_HEK3_1CTTINS_PE3') it is a CTT insertion (not a replacement of T to CTT), or am I missing something? This is the corresponding RTT sequence from the raw .csv: GCCATCAAAG
I also have another example with a GTA insertion at position 1 (EDFIG10_HEK293T_RNF2_1GTAINS_PE3, image below) where I count an overhang of 14 but see the value 13 in the feature_matrix.csv. Here, the RTT sequence is: AACGAACACCTCAGTAC
If this is indeed the case for all insertions/deletions, it wouldn't take anything away from the easy-prime model/performance since everything is still consistent when using the usual input via easy-prime.cc website. I'm only asking because I'm preparing my data manually (to get PE2 predictions) and would like to keep it consistent with your training dataset. :)
Thanks for the more detailed information!
In all these insertion examples, I think your numbers are correct and easy-prime code has "miscalculated" them. For example, for this CTT insertion example, GCCATCAAAG
, I treat GCCATCA
aag, this A
as the target mutation (reference allele in a vcf format), and the Aaag
as the alternative allele. That's why my calculation is 6. But I agree, the correct number is 7.
Dear Yichao,
Thanks for sharing your repo with that much useful data/scripts! :) Would you by chance also have the raw-data of your the .csv files stored under
https://github.com/YichaoOU/easy_prime/tree/master/PE_data_collection
? As I understand, these are processed files based on the different datasets but initially there must have been more raw-format data such asoriginal vs edited sequence
?Of course I could also try to dig them out from the original papers (such as Anzalone et al.) but if you already have them in a nice format, it would save a lot of time for me. :)
Best, Nicolas