FunctionLab / sei-framework

code to run sei and obtain sei and sequence class predictions
Other
86 stars 6 forks source link

From seqclass score to seqclass prediction #17

Open JohnsonStev opened 1 year ago

JohnsonStev commented 1 year ago

Dear developers of sei,

First, thank you for developing this amazing tool. I am doing variant effect prediction with sei and come up with this question. Apart from the seqclass that is being affected the most, I would also want to know what the original seqclass of the reference allele is. If I understand correctly, these two can be different class, right? In order to get the seqclass of reference allele, Igenerated seqclass score for the ref allele via 2_raw_sc_score.sh. However, I don't really understand how to interpret these raw scores. I am hoping that you can help me with this. Thanks.

jzthree commented 1 year ago

Hi,

Thanks! You can get sequence class assignments from bed file here https://zenodo.org/record/7113989. Yes the most affected sequence class does not have to be the assigned sequence class, even though it's usually the same.

For the raw scores, it is explained here:

Raw sequence class scores: For sequences only. Raw sequence class scores are projection scores of chromatin profile predictions projected on the unit-length vectors representing each sequence class. This is an intermediate score originally developed for variant score prediction and is made available for use for developing downstream analyses or applications, such as using them as a sequence representation. Note our manuscript uses the Louvain community clustering, whole-genome sequence class annotation of the human genome whenever we apply sequence classes to reference genome sequences, and we encourage the use of these annotations over the raw sequence class scores when possible. Sequence class annotations for hg38 and hg19 (lifted over from hg38) are available for download from this Zenodo record.

JohnsonStev commented 1 year ago

Thank you Jian for the information. The problem is that some of the variants I am analyzing were on the alternative contigs in the hg38 genome, which were not covered by the annotation bed file.

jzthree commented 1 year ago

Got it. Then an option can be training a classifier model (e.g. with https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html, which supports multiclass='multinomial' ) to predict sequence class as assigned in the bed files from the raw sequence class scores for corresponding positions, and then apply the classifiers to the raw scores on your alternative contigs. We may be able to release a classifier to do this task in the future but we don't have an "official" model for now.