frederikkemarin / BEND

Benchmarking DNA Language Models on Biologically Meaningful Tasks
BSD 3-Clause "New" or "Revised" License
95 stars 14 forks source link

Clarification Flanking #66

Closed tdsone closed 1 week ago

tdsone commented 1 month ago

Hey there!

I've been unsure about the use of the flanking parameter in the fetch method of the Fasta class. I would have expected that flank_left and flank_right from the datasets (e.g. for the gene_finding task [1]) are used in the fetch method with the flank parameter. But two things confuse me: 1) it's never done, nor can I find any other use of flank_left and flank_right in the code and 2) the flank parameter is applied symmetrically while flank left and right are unequal values.

Thanks for your help :)

[1] Gene Finding Data with flank_left and flank_right

head gene_finding.bed
chromosome  start   end transcript_id   strand  flank_left  flank_right length  split
chr7    142924180   142934109   ENST00000442623.1   -   614 386 9929    train
chr7    142939447   142941832   ENST00000409607.5   +   36  964 2385    train
frederikkemarin commented 1 week ago

Hello, Yes I understand your confusion. The values given in the start and end columns here already include the flank. The flank columns are provided just for extra information on the size of the flank added on each side. When fetching therefore only use the start and end columns.

If you still have questions please reach out.