Open imk1 opened 2 years ago
Beluga requires 2kb sequence. Padding with N is not guaranteed to give meaningful results. If your sequence has any flanking sequence in the genomic context, you can add that to both sides.
I ran Beluga (using this site: https://humanbase.flatironinstitute.org/deepsea/) using sequences < 2kb, and Beluga ran to completion. Do you know how Beluga modified the sequences to convert them into 2kb sequences? Thanks!
Thanks for letting us know. It should actually only allow sequences >2kb - we are looking into this and will update here once it's fixed
Thanks in advance for keeping me posted!
I was wondering if you have an update on this. Thanks!
Sorry for late update. Currently if the input is smaller than 2kb, it will be padded with "N"s. I don't recommend using fasta input smaller than 2kb unless it is very close to 2kb say only a few bps off. I would recommend adding any flanking sequence to your sequence of interest. We should update the website in terms of input length instructions (Beluga uses 2000bp, Sei uses 4096bp and SeqWeaver uses 1000bp).
Thanks! If I were to input, say, a 1kb sequence into Beluga, would it get padded with 500 Ns on either side, or would the input be the sequence I inputted followed by 1,000 Ns? Thanks!
It will be padded with 500 Ns on either side. How the Ns will affect the model prediction is largely tested and thus not recommended (in the training the Ns will only appear in assembly gaps and are very rare)
That makes sense. Thank you!
How does Beluga handle sequences < 1,000bp? Does it center on the input sequence and pad it with N's, or does it do something else? Thanks!