Sequences < 1,000bp - Githubissues

FunctionLab / ExPecto

predicting expression effects of human genome variants ab initio from sequence

121 stars 41 forks source link

Sequences < 1,000bp #24

Open imk1 opened 2 years ago

imk1 commented 2 years ago

How does Beluga handle sequences < 1,000bp? Does it center on the input sequence and pad it with N's, or does it do something else? Thanks!

jzthree commented 2 years ago

Beluga requires 2kb sequence. Padding with N is not guaranteed to give meaningful results. If your sequence has any flanking sequence in the genomic context, you can add that to both sides.

imk1 commented 2 years ago

I ran Beluga (using this site: https://humanbase.flatironinstitute.org/deepsea/) using sequences < 2kb, and Beluga ran to completion. Do you know how Beluga modified the sequences to convert them into 2kb sequences? Thanks!

jzthree commented 2 years ago

Thanks for letting us know. It should actually only allow sequences >2kb - we are looking into this and will update here once it's fixed

imk1 commented 2 years ago

Thanks in advance for keeping me posted!

imk1 commented 2 years ago

I was wondering if you have an update on this. Thanks!

jzthree commented 2 years ago

Sorry for late update. Currently if the input is smaller than 2kb, it will be padded with "N"s. I don't recommend using fasta input smaller than 2kb unless it is very close to 2kb say only a few bps off. I would recommend adding any flanking sequence to your sequence of interest. We should update the website in terms of input length instructions (Beluga uses 2000bp, Sei uses 4096bp and SeqWeaver uses 1000bp).

imk1 commented 2 years ago

Thanks! If I were to input, say, a 1kb sequence into Beluga, would it get padded with 500 Ns on either side, or would the input be the sequence I inputted followed by 1,000 Ns? Thanks!

jzthree commented 2 years ago

It will be padded with 500 Ns on either side. How the Ns will affect the model prediction is largely tested and thus not recommended (in the training the Ns will only appear in assembly gaps and are very rare)

imk1 commented 2 years ago

That makes sense. Thank you!