MAGICS-LAB / DNABERT_2

[ICLR 2024] DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome
Apache License 2.0
271 stars 63 forks source link

How to specifically implement the task of Enhancer promoter interaction? #105

Closed yangzhao1230 closed 3 months ago

yangzhao1230 commented 3 months ago

This task requires two sequences as input, but neither the paper nor the code provides a detailed implementation method. Could you please provide a simple example of how to do this? For instance, concatenating two sequences together and feeding them into the model to predict the labels seems unreasonable. I would like to know the specific approach.

Zhihan1996 commented 3 months ago

You can look at the lines 131-132 of https://github.com/MAGICS-LAB/DNABERT_2/blob/main/finetune/train.py for more details. Same as BERT, we explicitly tell the tokenizer that current input contains two sequences, then it will concatenate the input_ids of both sequences and assign distinct type_token_ids to them.