HazyResearch / hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena
https://arxiv.org/abs/2306.15794
Apache License 2.0
602 stars 83 forks source link

nucleotide finetuning #19

Closed wawpaopao closed 1 year ago

wawpaopao commented 1 year ago

when I ran python -m train wandb=null experiment=hg38/nucleotide_transformer dataset_name=enhancer dataset.max_length=500 model.layer.l_max=1026, something wrong, Could not override 'dataset_name'. To append to your config use +dataset_name=enhancer Key 'dataset_name' is not in struct full_key: dataset_name object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Segmentation fault (core dumped)

wawpaopao commented 1 year ago

and what's difference between huggingface trainer provided in colab? when i use colab to finetuning hyenadna in nucleotide transformer dataset, the performence is a bit low....

exnx commented 1 year ago

In your first post, the correct flag is: dataset.dataset_name. You forgot the prefix dataset before dataset_name.

Regarding your second post, the main differences are:

wawpaopao commented 1 year ago

Thanks,I use colab file to finetune..like DNABRET2 dataset ,just the same as nucleotide dataset. I found the mcc was a bit low . I don't know the error... could you help me take a look? def run_train():

# experiment settings:
num_epochs = 80  # ~100 seems fine
max_length = 500  # max len of sequence of dataset (of what you want)
use_padding = True
data_path = './DNABERT_2/eval/GUE/EMP/H3'

batch_size = 256
learning_rate = 6e-4  # good default for Hyena
rc_aug = True  # reverse complement augmentation
add_eos = False  # add end of sentence token
weight_decay = 0.1

# for fine-tuning, only the 'tiny' model can fit on colab
pretrained_model_name = 'hyenadna-tiny-1k256d-seqlen'  # use None if training from scratch

# we need these for the decoder head, if using
use_head = True
n_classes = 1

# you can override with your own backbone config here if you want,
# otherwise we'll load the HF one by default
backbone_cfg = None

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)

# instantiate the model (pretrained here)
if pretrained_model_name in ['hyenadna-tiny-1k256d-seqlen']:
    # use the pretrained Huggingface wrapper instead
    model = HyenaDNAPreTrainedModel.from_pretrained(
        './checkpoints',
        pretrained_model_name,
        download=False,
        config=backbone_cfg,
        device=device,
        use_head=use_head,
        n_classes=n_classes,
    )

# from scratch
else:
    model = HyenaDNAModel(**backbone_cfg, use_head=use_head, n_classes=n_classes)
 # create tokenizer
tokenizer = CharacterTokenizer(
    characters=['A', 'C', 'G', 'T', 'N'],  # add DNA characters, N is uncertain
    model_max_length=max_length + 2,  # to account for special tokens, like EOS
    add_special_tokens=False,  # we handle special tokens elsewhere
    padding_side='left', # since HyenaDNA is causal, we pad on the left
)

# create datasets
ds_train = SupervisedDataset(tokenizer=tokenizer,
                             data_path = os.path.join(data_path, "train.csv"),
                             kmer = -1)
ds_test = SupervisedDataset(tokenizer=tokenizer,
                            data_path= os.path.join(data_path,'test.csv'),
                            kmer = -1)
train_loader = DataLoader(ds_train, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(ds_test, batch_size=batch_size, shuffle=False)

# loss function
loss_fn = nn.BCEWithLogitsLoss()
#loss_fn = nn.MSELoss()
# create optimizer
optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

model.to(device)

for epoch in range(num_epochs):
    train(model, device, train_loader, optimizer, epoch, loss_fn)
    test(model, device, test_loader, loss_fn)
    optimizer.step()

if name == "main": run_train()

exnx commented 1 year ago

As mentioned, the colab is missing a lot of stuff to get competitive results. The colab is for education purposes mainly. To get good results, you'll need to use the main repo for finetuning. Also, the hyperparameters will matter a lot too, which is something only you will find by running sweeps of different hyperparameters on the actual datasets.

Unfortunately we're not able to support you in finetuning on your own different datasets. We mainly support on reproducing results from the paper.

But maybe this new docker image will help, which has the Nucleotide Transformer datasets and the exact launch commands and hyperparameters used to get best performance. The environment, pretrained weights, datasets, launch commands are all inside the Docker image, you just need to pull and launch it. Perhaps you can "reverse engineer" those settings for what works best for you on your own datasets.

docker pull hyenadna/hyena-dna-nt6:latest 
docker run --gpus all -it -p80:3000 hyenadna/hyena-dna-nt6 /bin/bash

This will land you inside the /wdr, which has a file named launch_commands_nucleotide_transformer with all the launch commands for the 18 NT datasets.

wawpaopao commented 1 year ago

thanks!