FunctionLab / selene

a framework for training sequence-level deep learning networks
BSD 3-Clause Clear License
373 stars 87 forks source link

Variant effect prediction REF mismatch #187

Closed okurman closed 7 months ago

okurman commented 2 years ago

Dear Selene/Sei developers, thank you for the colossal work you've undertaken.

I have a question regarding the variant effect prediction functionality of Sei model. I am using the model to calculate the variant effects using gnomAD, there seem to be many mismatches between the REFs of the gnomAD variants and the GRCh38 assembly fasta file. So, my question is, in these cases, does Selene use the REF of the given VCF file or does it use the corresponding NT from the GRCh38 fasta file?

kathyxchen commented 1 year ago

Hi @okurman, sorry for the late response. In these cases, Selene will use the REF of the given VCF file but the remainder of the 4096 bp sequence will be retrieved from the FASTA file.

Selene outputs warnings whenever this situation occurs, if you have a chance to review them. Often what we see is ref/alt is swapped in the VCF files (e.g. alt actually matches the GRCh38 assembly) - maybe that is happening?