Closed JuseTiZ closed 1 year ago
Upon reviewing the model testing code in the repository, I've noticed that the output files seem to be named with the suffixes wt_prediction.npy
and mt_prediction.npy
. However, I'm currently trying to understand if it is possible to extract information about specific variants from these files.
From what I can observe, there appears to be no coordinate information within the files, which leaves me uncertain about how to associate the predictions with specific genetic variants. Could you please provide some insights on how to relate the contents of these .npy
files to particular variants? Is there an additional step or a method within the codebase that I might have overlooked for this purpose?
Thank you for your time and help.
Thank you for your interest in our work. I hope the following example can make pipeline clearer.
Step 1. You have a list of the variants, stored in Variant Call Format (.vcf) or Browser Extensible Data (.bed) files. For each variant, you need to extract the following information, the chromosome, the position (coordinate), the reference (WT) and the alternative allele (MT). 4 numpy arrays are generated to store those information in the following way. [chr_1, chr_2, ..., chr_i, ...], [pos_1, pos_2, ..., pos_i, ...], [ref_1, ref_2, ..., ref_i, ...], [alt_1, alt_2, ..., alt_i, ...], where subscript i stands for the ith variant you have. Those are the 4 variant related files needed by the "prepare_seq.py". Please note that you may need to shift the coordinates for 1 bp, especially for some 'indel'. To confirm the REF sequence and the variant's REF are the same, I have added the checking "if(not np.array_equal(chr_seq[coor_list[eqtl_i] + bp_i, :], embedding_dict[wt_list[eqtl_i][bp_i]][0]))" in the "prepare_seq.py". You may get a warning "sequence alteration wrong" if there is a mismatching.
Step 2. You have collected the variants information. You have the preprocessed 3D chromatin structure, e.g. our zenodo provided. By running the 'prepare_seq.npy', you may get all inputs files needed for the model, "wt_seq", "mt_seq" and "structure_matching", which is telling the model which node to look at (where is the mutation located). Please note that, the order of the numpy array has not changed. The ith element in "wt_seq", "mt_seq" and "structure_matching" is representing the ith variant in the original input, e.g. your vcf file.
Step 3. You feed the variants information into a model and get the "wt_prediction" and "mt_prediction". The [i, j] element in those 2 arrays are the predictions for the ith variant and the jth epigenetic event for the Ref and Alt respectively. You may find the name of the epigenetic event in our supplementary table "All models testset AUROC and AUPRC.xlsx" in zenodo.
Step 4. You may use the predicted epigenetic profile changes, i.e. the differences between f(wt) and f(mt), for downstream tasks, like our few shot learning.
Please let me know if there is still any confusion or if there is anything that I may help.
Best wishes.
The step-by-step breakdown has clarified the process significantly, and I now understand how to proceed with the identification and analysis of the variants using your pipeline.
I appreciate the time you took to elucidate each stage, from extracting the necessary variant information to interpreting the predictions for the epigenetic events. The example and specific instructions on handling potential mismatches are especially helpful.
If I encounter any further uncertainties or require additional assistance as I work through the data, I will reach out.
Thank you once again for your support and prompt response.
Hello,
I've successfully executed the code as per the documentation provided in the repository. However, I've encountered an issue with the large number of output files generated, and I'm having difficulty identifying the specific files that contain the epigenetic profile data.
Could you please provide some guidance on where I might find the epigenetic profile? Specifically, is the profile stored in a
.npy
file format? If it is, could you indicate in which directory I should look for it? For instance, is it within the model_prediction directory or located somewhere else?I appreciate your assistance in navigating the outputs.
Thank you!