Open QAQ1551QAQ opened 1 year ago
I have this issue too. I found this error TypeError: dot() got an unexpected keyword argument 'trans_b'
, thus removed this attribute from the code (not good practice though). It yielded another error which is in compatibility of shapes in dot product.
Still trying to check it out and if any of the authors can refer us to the problem then please.
这是来自QQ邮箱的假期自动回复邮件。 您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
Hey,
Can you please try to install triton from source with:
git clone https://github.com/openai/triton.git;
cd triton/python;
pip install cmake; # build-time dependency
pip install -e .
When I tried this it yielded an error:
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'TritonRelBuildWithAsserts', '-j64']' returned non-zero exit status 2.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building editable for triton
Failed to build triton
ERROR: Could not build wheels for triton, which is required to install pyproject.toml-based projects
The call stack is very long but this is a snippet of it.
********************************************************************************
An error happened while installing `triton` in editable mode.
The following steps are recommended to help debug this problem:
- Try to install the project normally, without using the editable mode.
Does the error still persist?
(If it does, try fixing the problem before attempting the editable mode).
- If you are using binary extensions, make sure you have all OS-level
dependencies installed (e.g. compilers, toolchains, binary libraries, ...).
- Try the latest version of setuptools (maybe the error was already fixed).
- If you (or your project dependencies) are using any setuptools extension
or customization, make sure they support the editable mode.
After following the steps above, if the problem still persists and
you think this is related to how setuptools handles editable installations,
please submit a reproducible example
(see https://stackoverflow.com/help/minimal-reproducible-example) to:
https://github.com/pypa/setuptools/issues
See https://setuptools.pypa.io/en/latest/userguide/development_mode.html for details.```
Hi guys, I found the solution. I spent 5h on @it...
So the problem is that in the model "zhihan1996/DNABERT-2-117M" that we load using : AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True) there is a script called flash_attn_triton.py (see https://huggingface.co/zhihan1996/DNABERT-2-117M/blob/main/flash_attn_triton.py) where the line : qk += tl.dot(q, k, trans_b=True) is no longer compatible with triton 2.0.1 (the one you get from pip install triton) or triton 2.1.0 (the one you get from git clone https://github.com/openai/triton.git). The function tl.dot() does not accept the parameter trans_b. See also: https://github.com/microsoft/DeepSpeed/issues/3491 The version of triton that is compatible is 2.0.0.dev20221202 (it has the parameter trans_b).
So you have to do: pip install triton==2.0.0.dev20221202 Unfortunately, this will install torch 1.13.1 instead of 2.
But that works fine after that :).
@raphaelmourad were you able to run the finetuning scripts after installing that version of Triton? I attempted to but it encountered a new error with Triton.
@Zhihan1996 I previously recommended installing directly from the Triton Github repo which I thought solved the issue but upon further inspection it did not. Have you been able to run the scripts from a fresh install?
@pjsample I could not run the scripts as there were other bugs coming. I had to make my own script and it worked after a lot of modifications.
The problem here is that we need to have the right versions for all python modules to install, but they were not given. This is because for instance for triton module as it is updated it becomes not compatible with DNABERT2.
@raphaelmourad Hey, could you provide the script that has run successfully?
@GriffithLin I made my own jupyter notebook as follows (I had to fix a lot of bugs):
import os import sys import time from os import path import gc
import numpy as np import pandas as pd
import torch import triton from transformers import AutoTokenizer, AutoModel from torch.utils.data import TensorDataset, DataLoader
print(np.version) # Becareful: numpy should be 1.19 (and not 1.2) for spektral to work! print(triton.version)
print(torch.cuda.get_device_name(0))
os.chdir("/media/mourad/SSD2/DataAugmentDL") print(os.getcwd())
sys.path.append("/media/mourad/SSD2/DataAugmentDL/DNABERT2/DNABERT_2-main/finetune/") from train import *
model_args=ModelArguments() data_args=DataArguments() training_args=TrainingArguments
data_args.data_path="/media/mourad/SSD2/DataAugmentDL/DNABERT2/GUE/EMP/H3K4me1/" model_args.model_name_or_path="/media/mourad/SSD2/DataAugmentDL/DNABERT2/DNABERT-2-117M/"
training_args.deepspeed_plugin=None
training_args.run_name="DNABERT2_aug" training_args.model_max_length=20 training_args.per_device_train_batch_size=32 training_args.per_device_eval_batch_size=16 training_args.gradient_accumulation_steps=1 training_args.learning_rate=3e-5 training_args.num_train_epochs=4 training_args.fp16=False training_args.save_steps=400 training_args.output_dir="results/DNABERT2/"+expe training_args.evaluation_strategy="steps" training_args.eval_steps=100 training_args.warmup_steps=50 training_args.logging_steps=100000 training_args.find_unused_parameters=False
training_args.device=torch.device('cuda:0') training_args.report_to=["tensorboard"] training_args.world_size=1 training_args.per_device_train_batch_size=8 training_args.train_batch_size=32 training_args.eval_batch_size=32 training_args.test_batch_size=32 training_args.batch_size=32 training_args.num_training_steps=100 training_args.n_gpu=1 training_args.distributed_state=None training_args.local_rank=-1
tokenizer = transformers.AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right", use_fast=True, trust_remote_code=True, )
if "InstaDeepAI" in model_args.model_name_or_path: tokenizer.eos_token = tokenizer.pad_token
train_dataset = SupervisedDataset(tokenizer=tokenizer, data_path=os.path.join(data_args.data_path, "train.csv"), kmer=data_args.kmer) val_dataset = SupervisedDataset(tokenizer=tokenizer, data_path=os.path.join(data_args.data_path, "dev.csv"), kmer=data_args.kmer) test_dataset = SupervisedDataset(tokenizer=tokenizer, data_path=os.path.join(data_args.data_path, "test.csv"), kmer=data_args.kmer) data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
model=transformers.AutoModelForSequenceClassification.from_pretrained( model_args.model_name_or_path, cache_dir=training_args.cache_dir, num_labels=train_dataset.num_labels, trust_remote_code=True, output_hidden_states=False, )
if model_args.use_lora: lora_config = LoraConfig( r=model_args.lora_r, lora_alpha=model_args.lora_alpha, target_modules=list(model_args.lora_target_modules.split(",")), lora_dropout=model_args.lora_dropout, bias="none", task_type="SEQ_CLS", inference_mode=False, ) model = get_peft_model(model, lora_config) model.print_trainable_parameters()
trainer = transformers.Trainer(model=model, tokenizer=tokenizer, args=training_args, compute_metrics=compute_metrics, train_dataset=train_dataset, eval_dataset=val_dataset, data_collator=data_collator) trainer.local_rank=training_args.local_rank trainer.train()
if training_args.eval_and_save_results: results_path = training_args.output_dir+"/"+augmentation+"/metrics" results = trainer.evaluate(eval_dataset=test_dataset) os.makedirs(results_path, exist_ok=True) with open(os.path.join(results_path, "test_results.json"), "w") as f: json.dump(results, f)
@raphaelmourad thanks! Another question about environment. I have installed triton 2.0.0.dev20221202 and torch 1.13.1 .But when I run test code based on Quick Start , I have found this error RuntimeError: Triton requires CUDA 11.4+ (my cuda version is 11.7 which is satisfied )
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizersbefore the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid usingtokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
File "", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0--2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-42648570729a4835b21c1c18cebedbfe-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('matrix', False, 64, False, False, True, 128, 128), (True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False), (False, False), (True, False), (True, False), (True, False), (False, False), (False, False), (False, False), (True, False), (True, False), (True, False), (True, False)))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 17, in
hidden_states = model(inputs)[0] # [1, sequence_length, 768]
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data3/linming/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py", line 608, in forward
encoder_outputs = self.encoder(
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data3/linming/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py", line 446, in forward
hidden_states = layer_module(hidden_states,
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data3/linming/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py", line 327, in forward
attention_output = self.attention(hidden_states, cu_seqlens, seqlen,
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data3/linming/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py", line 240, in forward
self_output = self.self(input_tensor, cu_seqlens, max_s, indices,
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/data3/linming/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py", line 181, in forward
attention = flash_attn_qkvpacked_func(qkv, bias)
File "/data3/linming/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/flash_attn_triton.py", line 1021, in forward
o, lse, ctx.softmax_scale = _flash_attn_forward(
File "/data3/linming/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/flash_attn_triton.py", line 826, in _flash_attn_forward
_fwd_kernel[grid]( # type: ignore
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/triton/runtime/autotuner.py", line 86, in run
return self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **kwargs, **config.kwargs)
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/triton/runtime/autotuner.py", line 200, in run
return self.fn.run(*args, **kwargs)
File "", line 41, in _fwd_kernel
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/triton/compiler.py", line 1256, in compile
asm, shared, kernel_name = _compile(fn, signature, device, constants, configs[0], num_warps, num_stages,
File "/data3/linming/.conda/envs/dna/lib/python3.8/site-packages/triton/compiler.py", line 901, in _compile
name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, module, device, num_warps, num_stages, extern_libs, cc)
RuntimeError: Triton requires CUDA 11.4+
# packages in environment at /data3/linming/.conda/envs/dna:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main defaults
_openmp_mutex 5.1 1_gnu defaults
accelerate 0.19.0 pypi_0 pypi
aiohttp 3.8.4 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
antlr4-python3-runtime 4.9.3 pypi_0 pypi
async-timeout 4.0.2 pypi_0 pypi
attrs 23.1.0 pypi_0 pypi
ca-certificates 2023.05.30 h06a4308_0 defaults
certifi 2023.5.7 pypi_0 pypi
charset-normalizer 3.1.0 pypi_0 pypi
cmake 3.26.3 pypi_0 pypi
datasets 2.12.0 pypi_0 pypi
dill 0.3.6 pypi_0 pypi
einops 0.6.1 pypi_0 pypi
evaluate 0.4.0 pypi_0 pypi
filelock 3.12.0 pypi_0 pypi
frozenlist 1.3.3 pypi_0 pypi
fsspec 2023.5.0 pypi_0 pypi
huggingface-hub 0.14.1 pypi_0 pypi
idna 3.4 pypi_0 pypi
jinja2 3.1.2 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1 defaults
libffi 3.4.4 h6a678d5_0 defaults
libgcc-ng 11.2.0 h1234567_1 defaults
libgomp 11.2.0 h1234567_1 defaults
libstdcxx-ng 11.2.0 h1234567_1 defaults
lit 16.0.6 pypi_0 pypi
markupsafe 2.1.2 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
multidict 6.0.4 pypi_0 pypi
multiprocess 0.70.14 pypi_0 pypi
ncurses 6.4 h6a678d5_0 defaults
networkx 3.1 pypi_0 pypi
numpy 1.24.4 pypi_0 pypi
nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi
nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi
nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi
nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi
nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi
nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi
nvidia-curand-cu11 10.2.10.91 pypi_0 pypi
nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi
nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi
nvidia-nccl-cu11 2.14.3 pypi_0 pypi
nvidia-nvtx-cu11 11.7.91 pypi_0 pypi
omegaconf 2.3.0 pypi_0 pypi
openssl 3.0.9 h7f8727e_0 defaults
packaging 23.1 pypi_0 pypi
pandas 2.0.3 pypi_0 pypi
peft 0.3.0 pypi_0 pypi
pillow 9.5.0 pypi_0 pypi
pip 23.1.2 py38h06a4308_0 defaults
psutil 5.9.5 pypi_0 pypi
pyarrow 12.0.0 pypi_0 pypi
python 3.8.17 h955ad1f_0 defaults
python-dateutil 2.8.2 pypi_0 pypi
pytz 2023.3 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
readline 8.2 h5eee18b_0 defaults
regex 2023.5.5 pypi_0 pypi
requests 2.31.0 pypi_0 pypi
responses 0.18.0 pypi_0 pypi
safetensors 0.3.1 pypi_0 pypi
setuptools 67.8.0 py38h06a4308_0 defaults
six 1.16.0 pypi_0 pypi
sqlite 3.41.2 h5eee18b_0 defaults
sympy 1.12 pypi_0 pypi
tk 8.6.12 h1ccaba5_0 defaults
tokenizers 0.13.3 pypi_0 pypi
torch 1.13.1 pypi_0 pypi
torchaudio 2.0.2 pypi_0 pypi
torchvision 0.15.2 pypi_0 pypi
tqdm 4.65.0 pypi_0 pypi
transformers 4.30.2 pypi_0 pypi
triton 2.0.0.dev20221202 pypi_0 pypi
typing-extensions 4.7.0 pypi_0 pypi
tzdata 2023.3 pypi_0 pypi
urllib3 2.0.3 pypi_0 pypi
wheel 0.38.4 py38h06a4308_0 defaults
xxhash 3.2.0 pypi_0 pypi
xz 5.4.2 h5eee18b_0 defaults
yarl 1.9.2 pypi_0 pypi
zlib 1.2.13 h5eee18b_0 defaults
@GriffithLin I have installed CUDA 12.2. Driver: NVIDIA-SMI 535.54.03. GPU: RTX3090 24Gb. Now you know all my config ;).
Package Version
absl-py 1.4.0 accelerate 0.21.0 aiohttp 3.8.4 aiosignal 1.3.1 asttokens 2.2.1 async-timeout 4.0.2 attrs 23.1.0 backcall 0.2.0 backports.functools-lru-cache 1.6.5 bio 1.5.9 biopython 1.81 biothings-client 0.3.0 Brotli 1.0.9 cachetools 5.3.1 certifi 2023.5.7 charset-normalizer 3.2.0 click 8.1.5 cmake 3.26.4 colorama 0.4.6 comm 0.1.3 contourpy 1.1.0 cycler 0.11.0 dataclasses 0.8 datasets 2.13.1 debugpy 1.6.7 decorator 5.1.1 dill 0.3.6 einops 0.6.1 executing 1.2.0 fairscale 0.4.13 filelock 3.12.2 fonttools 4.41.0 frozenlist 1.4.0 fsspec 2023.6.0 google-auth 2.22.0 google-auth-oauthlib 1.0.0 gprofiler-official 1.0.0 grpcio 1.56.0 h5py 3.9.0 huggingface-hub 0.16.4 idna 3.4 importlib-metadata 6.8.0 importlib-resources 6.0.0 IProgress 0.4 ipykernel 6.24.0 ipython 8.12.0 ipywidgets 8.0.7 jedi 0.18.2 Jinja2 3.1.2 joblib 1.3.0 jupyter_client 8.3.0 jupyter_core 4.12.0 jupyterlab-widgets 3.0.8 kiwisolver 1.4.4 lit 16.0.6 Markdown 3.4.3 MarkupSafe 2.1.3 matplotlib 3.7.2 matplotlib-inline 0.1.6 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 mygene 3.2.2 mypy-extensions 1.0.0 nest-asyncio 1.5.6 networkx 3.1 numpy 1.24.4 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 oauthlib 3.2.2 packaging 23.1 pandas 2.0.3 parso 0.8.3 peft 0.4.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.0.0 pip 23.2 platformdirs 3.9.1 pooch 1.7.0 progressbar 2.5 prompt-toolkit 3.0.39 protobuf 4.23.4 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 12.0.1 pyasn1 0.5.0 pyasn1-modules 0.3.0 Pygments 2.15.1 pyparsing 3.0.9 pyre-extensions 0.0.30 PySocks 1.7.1 python-dateutil 2.8.2 pytz 2023.3 PyYAML 6.0.1 pyzmq 25.1.0 regex 2023.6.3 requests 2.31.0 requests-oauthlib 1.3.1 responses 0.18.0 rsa 4.9 sacremoses 0.0.53 safetensors 0.3.1 scikit-learn 1.3.0 scipy 1.10.1 setuptools 68.0.0 six 1.16.0 stack-data 0.6.2 sympy 1.12 tensorboard 2.13.0 tensorboard-data-server 0.7.1 threadpoolctl 3.2.0 tokenizers 0.13.3 torch 1.13.1 torchbearer 0.5.3 torcheval 0.0.6 torchtnt 0.1.0 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 transformers 4.30.2 triton 2.0.0.dev20221202 typing_extensions 4.7.1 typing-inspect 0.9.0 tzdata 2023.3 urllib3 1.26.16 wcwidth 0.2.6 Werkzeug 2.3.6 wheel 0.40.0 widgetsnbextension 4.0.8 xxhash 0.0.0 yarl 1.9.2 zipp 3.16.2
@GriffithLin also in the module train.py (folder "finetune"), I changed this (I modified functions get_process_log_level() and get_warmup_steps() as there were bugs:
@dataclass class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
run_name: str = field(default="run")
optim: str = field(default="adamw_torch")
model_max_length: int = field(default=512, metadata={"help": "Maximum sequence length."})
gradient_accumulation_steps: int = field(default=1)
per_device_train_batch_size: int = field(default=1)
per_device_eval_batch_size: int = field(default=1)
num_train_epochs: int = field(default=1)
fp16: bool = field(default=False)
logging_steps: int = field(default=100)
log_level: str = field(default="info")
save_steps: int = field(default=100)
eval_steps: int = field(default=100)
evaluation_strategy: str = field(default="steps"),
warmup_steps: int = field(default=50)
weight_decay: float = field(default=0.01)
learning_rate: float = field(default=1e-4)
save_total_limit: int = field(default=3)
load_best_model_at_end: bool = field(default=True)
output_dir: str = field(default="output")
find_unused_parameters: bool = field(default=False)
checkpointing: bool = field(default=False)
dataloader_pin_memory: bool = field(default=False)
eval_and_save_results: bool = field(default=True)
save_model: bool = field(default=False)
seed: int = field(default=42)
def get_process_log_level():
return 10
def get_warmup_steps(num_training_steps):
return 8
@raphaelmourad thanks! Another question about environment. I have installed triton 2.0.0.dev20221202 and torch 1.13.1 .But when I run test code based on Quick Start , I have found this error RuntimeError: Triton requires CUDA 11.4+ (my cuda version is 11.7 which is satisfied )
I have the same problem as you, have you tackled this problem?
这是来自QQ邮箱的假期自动回复邮件。 您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
@raphaelmourad thanks! Another question about environment. I have installed triton 2.0.0.dev20221202 and torch 1.13.1 .But when I run test code based on Quick Start , I have found this error RuntimeError: Triton requires CUDA 11.4+ (my cuda version is 11.7 which is satisfied )
I have the same problem as you, have you tackled this problem?
@wzy-Sarah I have CUDA Version: 12.2.
@raphaelmourad thanks! Another question about environment. I have installed triton 2.0.0.dev20221202 and torch 1.13.1 .But when I run test code based on Quick Start , I have found this error RuntimeError: Triton requires CUDA 11.4+ (my cuda version is 11.7 which is satisfied )
I have the same problem as you, have you tackled this problem?
@wzy-Sarah I have CUDA Version: 12.2.
Can it work by reducing the torch version?
I think I solved it on my system. I have a NVIDIA A100, nvidia-smi
reports Driver Version: 535.104.05 CUDA Version: 12.2
. Same error about triton wanting CUDA 11+.
Made a new environment:
mamba create -n dna python=3.8
conda activate dna
Then I forced the torch CUDA version:
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
Then I installed the required packages via this requirements.txt (not pulling/installing triton from github):
triton==2.0.0.dev20221202 transformers==4.29.2 scikit-learn peft einops
Finally, I had to install a CUDA 11 nvcc in the conda environment, I believe triton gets confused by the system-wide CUDA 12 nvcc binary.
mamba install -c "nvidia/label/cuda-11.7.0" cuda-nvcc
At least the example data works now :)
Command:
export DATA_PATH=`pwd`/DNABERT_2/sample_data export LR=3e-5 export MAX_LENGTH=100 python DNABERT_2/finetune/train.py \ --model_name_or_path zhihan1996/DNABERT-2-117M \ --data_path ${DATA_PATH} \ --kmer -1 \ --run_name DNABERT2_${DATA_PATH} \ --model_max_length ${MAX_LENGTH} \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 16 \ --gradient_accumulation_steps 1 \ --learning_rate ${LR} \ --num_train_epochs 5 \ --fp16 \ --save_steps 200 \ --output_dir output/dnabert2 \ --evaluation_strategy steps \ --eval_steps 200 \ --warmup_steps 50 \ --logging_steps 100 \ --overwrite_output_dir True \ --log_level info \ --find_unused_parameters False
nvidia-smi
reports Python using the GPU.
WARNING:root:Perform single sequence classification... WARNING:root:Perform single sequence classification... WARNING:root:Perform single sequence classification... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Some weights of the model checkpoint at zhihan1996/DNABERT-2-117M were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.pooler.dense.weight', 'classifier.bias', 'bert.pooler.dense.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Using cuda_amp half precision backend ***** Running training ***** Num examples = 15 Num Epochs = 5 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 1 Total optimization steps = 10 Number of trainable parameters = 117,070,082 0%| | 0/10 [00:00, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 9/10 [00:05<00:00, 3.53it/s] Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 5.5699, 'train_samples_per_second': 13.465, 'train_steps_per_second': 1.795, 'train_loss': 0.6913905620574952, 'epoch': 5.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00, 1.80it/s] ***** Running Evaluation ***** Num examples = 15 Batch size = 16 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 107.90it/s]
这是来自QQ邮箱的假期自动回复邮件。 您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
I think I solved it on my system. I have a NVIDIA A100,
nvidia-smi
reportsDriver Version: 535.104.05 CUDA Version: 12.2
. Same error about triton wanting CUDA 11+.Made a new environment:
mamba create -n dna python=3.8 conda activate dna
Then I forced the torch CUDA version:
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
Then I installed the required packages via this requirements.txt (not pulling/installing triton from github):
triton==2.0.0.dev20221202 transformers==4.29.2 scikit-learn peft einops
Finally, I had to install a CUDA 11 nvcc in the conda environment, I believe triton gets confused by the system-wide CUDA 12 nvcc binary.
mamba install -c "nvidia/label/cuda-11.7.0" cuda-nvcc
At least the example data works now :)
Command:
export DATA_PATH=
pwd
/DNABERT_2/sample_data export LR=3e-5 export MAX_LENGTH=100python DNABERT_2/finetune/train.py \ --model_name_or_path zhihan1996/DNABERT-2-117M \ --data_path ${DATA_PATH} \ --kmer -1 \ --runname DNABERT2${DATA_PATH} \ --model_max_length ${MAX_LENGTH} \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 16 \ --gradient_accumulation_steps 1 \ --learning_rate ${LR} \ --num_train_epochs 5 \ --fp16 \ --save_steps 200 \ --output_dir output/dnabert2 \ --evaluation_strategy steps \ --eval_steps 200 \ --warmup_steps 50 \ --logging_steps 100 \ --overwrite_output_dir True \ --log_level info \ --find_unused_parameters False
nvidia-smi
reports Python using the GPU. Click for full log
I tried this but had this error assert q.is_cuda and k.is_cuda and v.is_cuda
.
这是来自QQ邮箱的假期自动回复邮件。 您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
I also tried compiling from source and got the autotune problem module 'triton' has no attribute 'autotune'
Hi, I met the same error:
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'TritonRelBuildWithAsserts', '-j64']' returned non-zero exit status 1.
Are there any efficient solutions?
这是来自QQ邮箱的假期自动回复邮件。 您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
Epoch [1/3]
KeyError Traceback (most recent call last) File:21, in _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE, IS_CAUSAL, BLOCK_HEADDIM, EVEN_M, EVEN_N, EVEN_HEADDIM, BLOCK_M, BLOCK_N, grid, num_warps, num_stages, extern_libs, stream, warmup)
KeyError: ('2-.-0-.-0--d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-14de7de5c4da5794c8ca14e7e41a122d-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('matrix', False, 64, False, False, True, 128, 128), (True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False), (False, False), (True, False), (True, False), (True, False), (False, False), (False, False), (False, False), (True, False), (True, False), (False, False), (False, False)))
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last) File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:937, in build_triton_ir(fn, signature, specialization, constants) 936 try: --> 937 generator.visit(fn.parse()) 938 except Exception as e:
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:855, in CodeGenerator.visit(self, node) 854 warnings.simplefilter("ignore", PendingDeprecationWarning) # python 3.8 --> 855 return super().visit(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/ast.py:371, in NodeVisitor.visit(self, node) 370 visitor = getattr(self, method, self.generic_visit) --> 371 return visitor(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:183, in CodeGenerator.visit_Module(self, node) 182 def visit_Module(self, node): --> 183 ast.NodeVisitor.generic_visit(self, node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/ast.py:379, in NodeVisitor.generic_visit(self, node) 378 if isinstance(item, AST): --> 379 self.visit(item) 380 elif isinstance(value, AST):
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:855, in CodeGenerator.visit(self, node) 854 warnings.simplefilter("ignore", PendingDeprecationWarning) # python 3.8 --> 855 return super().visit(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/ast.py:371, in NodeVisitor.visit(self, node) 370 visitor = getattr(self, method, self.generic_visit) --> 371 return visitor(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:252, in CodeGenerator.visit_FunctionDef(self, node) 251 # visit function body --> 252 has_ret = self.visit_compound_statement(node.body) 253 # finalize function
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:177, in CodeGenerator.visit_compound_statement(self, stmts) 176 for stmt in stmts: --> 177 self.last_ret_type = self.visit(stmt) 178 if isinstance(stmt, ast.Return):
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:855, in CodeGenerator.visit(self, node) 854 warnings.simplefilter("ignore", PendingDeprecationWarning) # python 3.8 --> 855 return super().visit(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/ast.py:371, in NodeVisitor.visit(self, node) 370 visitor = getattr(self, method, self.generic_visit) --> 371 return visitor(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:678, in CodeGenerator.visit_For(self, node) 677 self.scf_stack.append(node) --> 678 self.visit_compound_statement(node.body) 679 self.scf_stack.pop()
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:177, in CodeGenerator.visit_compound_statement(self, stmts) 176 for stmt in stmts: --> 177 self.last_ret_type = self.visit(stmt) 178 if isinstance(stmt, ast.Return):
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:855, in CodeGenerator.visit(self, node) 854 warnings.simplefilter("ignore", PendingDeprecationWarning) # python 3.8 --> 855 return super().visit(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/ast.py:371, in NodeVisitor.visit(self, node) 370 visitor = getattr(self, method, self.generic_visit) --> 371 return visitor(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:319, in CodeGenerator.visit_AugAssign(self, node) 318 assign = ast.Assign(targets=[node.target], value=rhs) --> 319 self.visit(assign) 320 return self.get_value(name)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:855, in CodeGenerator.visit(self, node) 854 warnings.simplefilter("ignore", PendingDeprecationWarning) # python 3.8 --> 855 return super().visit(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/ast.py:371, in NodeVisitor.visit(self, node) 370 visitor = getattr(self, method, self.generic_visit) --> 371 return visitor(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:301, in CodeGenerator.visit_Assign(self, node) 300 names = _names[0] --> 301 values = self.visit(node.value) 302 if not isinstance(names, tuple):
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:855, in CodeGenerator.visit(self, node) 854 warnings.simplefilter("ignore", PendingDeprecationWarning) # python 3.8 --> 855 return super().visit(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/ast.py:371, in NodeVisitor.visit(self, node) 370 visitor = getattr(self, method, self.generic_visit) --> 371 return visitor(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:339, in CodeGenerator.visit_BinOp(self, node) 338 lhs = self.visit(node.left) --> 339 rhs = self.visit(node.right) 340 fn = { 341 ast.Add: 'add', 342 ast.Sub: 'sub', (...) 352 ast.BitXor: 'xor', 353 }[type(node.op)]
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:855, in CodeGenerator.visit(self, node) 854 warnings.simplefilter("ignore", PendingDeprecationWarning) # python 3.8 --> 855 return super().visit(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/ast.py:371, in NodeVisitor.visit(self, node) 370 visitor = getattr(self, method, self.generic_visit) --> 371 return visitor(node)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:797, in CodeGenerator.visit_Call(self, node) 795 if (hasattr(fn, 'self') and self.is_triton_tensor(fn.self)) \ 796 or impl.is_builtin(fn): --> 797 return fn(*args, _builder=self.builder, **kws) 798 if fn in self.builtins.values():
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/impl/base.py:22, in builtin..wrapper(*args, *kwargs)
18 raise ValueError(
19 "Did you forget to add @triton.jit ? "
20 "(
_builder
argument must be provided outside of JIT functions.)" 21 ) ---> 22 return fn(args, **kwargs)TypeError: dot() got an unexpected keyword argument 'trans_b'
The above exception was the direct cause of the following exception:
CompilationError Traceback (most recent call last) Cell In[15], line 1 ----> 1 teacher_train(T_model, cfg, train_loader, test_loader)
Cell In[14], line 39, in teacher_train(model, config, train_loader, test_loader) 37 mask = mask.to(config.device) 38 labels = labels.to(config.device) ---> 39 outputs = model(ids, mask) 40 model.zero_grad() 41 loss = F.cross_entropy(outputs, labels)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []
Cell In[12], line 12, in BERT_Model.forward(self, context, mask) 11 def forward(self, context, mask): ---> 12 outputs = self.bert(context, attention_mask=mask) 13 pooled = outputs[1] 14 out = self.fc(pooled)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py:608, in BertModel.forward(self, input_ids, token_type_ids, attention_mask, position_ids, output_all_encoded_layers, masked_tokens_mask, **kwargs) 605 first_col_mask[:, 0] = True 606 subset_mask = masked_tokens_mask | first_col_mask --> 608 encoder_outputs = self.encoder( 609 embedding_output, 610 attention_mask, 611 output_all_encoded_layers=output_all_encoded_layers, 612 subset_mask=subset_mask) 614 if masked_tokens_mask is None: 615 sequence_output = encoder_outputs[-1]
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py:446, in BertEncoder.forward(self, hidden_states, attention_mask, output_all_encoded_layers, subset_mask) 444 if subset_mask is None: 445 for layer_module in self.layer: --> 446 hidden_states = layer_module(hidden_states, 447 cu_seqlens, 448 seqlen, 449 None, 450 indices, 451 attn_mask=attention_mask, 452 bias=alibi_attn_mask) 453 if output_all_encoded_layers: 454 all_encoder_layers.append(hidden_states)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py:327, in BertLayer.forward(self, hidden_states, cu_seqlens, seqlen, subset_idx, indices, attn_mask, bias) 305 def forward( 306 self, 307 hidden_states: torch.Tensor, (...) 313 bias: Optional[torch.Tensor] = None, 314 ) -> torch.Tensor: 315 """Forward pass for a BERT layer, including both attention and MLP. 316 317 Args: (...) 325 bias: None or (batch, heads, max_seqlen_in_batch, max_seqlen_in_batch) 326 """ --> 327 attention_output = self.attention(hidden_states, cu_seqlens, seqlen, 328 subset_idx, indices, attn_mask, bias) 329 layer_output = self.mlp(attention_output) 330 return layer_output
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py:240, in BertUnpadAttention.forward(self, input_tensor, cu_seqlens, max_s, subset_idx, indices, attn_mask, bias) 218 def forward( 219 self, 220 input_tensor: torch.Tensor, (...) 226 bias: Optional[torch.Tensor] = None, 227 ) -> torch.Tensor: 228 """Forward pass for scaled self-attention without padding. 229 230 Arguments: (...) 238 bias: None or (batch, heads, max_seqlen_in_batch, max_seqlen_in_batch) 239 """ --> 240 self_output = self.self(input_tensor, cu_seqlens, max_s, indices, 241 attn_mask, bias) 242 if subset_idx is not None: 243 return self.output(index_first_axis(self_output, subset_idx), 244 index_first_axis(input_tensor, subset_idx))
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, *kwargs) 1496 # If we don't have any hooks, we want to skip the rest of the logic in 1497 # this function, and just call forward. 1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1499 or _global_backward_pre_hooks or _global_backward_hooks 1500 or _global_forward_hooks or _global_forward_pre_hooks): -> 1501 return forward_call(args, **kwargs) 1502 # Do not call functions when jit is used 1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/bert_layers.py:181, in BertUnpadSelfAttention.forward(self, hidden_states, cu_seqlens, max_seqlen_in_batch, indices, attn_mask, bias) 179 bias_dtype = bias.dtype 180 bias = bias.to(torch.float16) --> 181 attention = flash_attn_qkvpacked_func(qkv, bias) 182 attention = attention.to(orig_dtype) 183 bias = bias.to(bias_dtype)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/torch/autograd/function.py:506, in Function.apply(cls, *args, *kwargs) 503 if not torch._C._are_functorch_transforms_active(): 504 # See NOTE: [functorch vjp and autograd interaction] 505 args = _functorch.utils.unwrap_dead_wrappers(args) --> 506 return super().apply(args, **kwargs) # type: ignore[misc] 508 if cls.setup_context == _SingleLevelFunction.setup_context: 509 raise RuntimeError( 510 'In order to use an autograd.Function with functorch transforms ' 511 '(vmap, grad, jvp, jacrev, ...), it must override the setup_context ' 512 'staticmethod. For more details, please see ' 513 'https://pytorch.org/docs/master/notes/extending.func.html')
File ~/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/flash_attn_triton.py:1021, in _FlashAttnQKVPackedFunc.forward(ctx, qkv, bias, causal, softmax_scale) 1019 if qkv.stride(-1) != 1: 1020 qkv = qkv.contiguous() -> 1021 o, lse, ctx.softmax_scale = _flash_attn_forward( 1022 qkv[:, :, 0], 1023 qkv[:, :, 1], 1024 qkv[:, :, 2], 1025 bias=bias, 1026 causal=causal, 1027 softmax_scale=softmax_scale) 1028 ctx.save_for_backward(qkv, o, lse, bias) 1029 ctx.causal = causal
File ~/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/5fd206e1a13cee3ef4a608677312175eb6f8143d/flash_attn_triton.py:826, in _flash_attn_forward(q, k, v, bias, causal, softmax_scale) 823 # BLOCK = 128 824 # num_warps = 4 if d <= 64 else 8 825 grid = lambda META: (triton.cdiv(seqlen_q, META['BLOCK_M']), batch nheads) --> 826 _fwd_kernel[grid]( # type: ignore 827 q, 828 k, 829 v, 830 bias, 831 o, 832 lse, 833 tmp, 834 softmax_scale, 835 q.stride(0), 836 q.stride(2), 837 q.stride(1), 838 k.stride(0), 839 k.stride(2), 840 k.stride(1), 841 v.stride(0), 842 v.stride(2), 843 v.stride(1), 844 bias_strides, 845 o.stride(0), 846 o.stride(2), 847 o.stride(1), 848 nheads, 849 seqlen_q, 850 seqlen_k, 851 seqlen_q_rounded, 852 d, 853 seqlen_q // 32, 854 seqlen_k // 32, # key for triton cache (limit number of compilations) 855 # Can't use kwargs here because triton autotune expects key to be args, not kwargs 856 # IS_CAUSAL=causal, BLOCK_HEADDIM=d, 857 bias_type, 858 causal, 859 BLOCK_HEADDIM, 860 # BLOCK_M=BLOCK, BLOCK_N=BLOCK, 861 # num_warps=num_warps, 862 # num_stages=1, 863 ) 864 return o, lse, softmax_scale
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/runtime/autotuner.py:90, in Autotuner.run(self, *args, *kwargs) 88 if config.pre_hook is not None: 89 config.pre_hook(self.nargs) ---> 90 return self.fn.run(args, num_warps=config.num_warps, num_stages=config.num_stages, kwargs, config.kwargs)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/runtime/autotuner.py:199, in Heuristics.run(self, *args, kwargs) 197 for v, heur in self.values.items(): 198 kwargs[v] = heur({dict(zip(self.arg_names, args)), *kwargs}) --> 199 return self.fn.run(args, **kwargs)
File:41, in _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE, IS_CAUSAL, BLOCK_HEADDIM, EVEN_M, EVEN_N, EVEN_HEADDIM, BLOCK_M, BLOCK_N, grid, num_warps, num_stages, extern_libs, stream, warmup)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:1621, in compile(fn, **kwargs) 1619 next_module = parse(path) 1620 else: -> 1621 next_module = compile(module) 1622 fn_cache_manager.put(next_module, f"{name}.{ir}") 1623 if os.path.exists(path):
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:1550, in compile..(src)
1545 extern_libs = kwargs.get("extern_libs", dict())
1546 # build compilation stages
1547 stages = {
1548 "ast": (lambda path: fn, None),
1549 "ttir": (lambda path: parse_mlir_module(path, context),
-> 1550 lambda src: ast_to_ttir(src, signature, configs[0], constants)),
1551 "ttgir": (lambda path: parse_mlir_module(path, context),
1552 lambda src: ttir_to_ttgir(src, num_warps, num_stages, capability)),
1553 "llir": (lambda path: Path(path).read_text(),
1554 lambda src: ttgir_to_llir(src, extern_libs, capability)),
1555 "ptx": (lambda path: Path(path).read_text(),
1556 lambda src: llir_to_ptx(src, capability)),
1557 "cubin": (lambda path: Path(path).read_bytes(),
1558 lambda src: ptx_to_cubin(src, capability))
1559 }
1560 # find out the signature of the function
1561 if isinstance(fn, triton.runtime.JITFunction):
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:962, in ast_to_ttir(fn, signature, specialization, constants) 961 def ast_tottir(fn, signature, specialization, constants): --> 962 mod, = build_triton_ir(fn, signature, specialization, constants) 963 return optimize_triton_ir(mod)
File ~/anaconda3/envs/pytorch_python38/lib/python3.8/site-packages/triton/compiler.py:942, in build_triton_ir(fn, signature, specialization, constants) 940 if node is None or isinstance(e, (NotImplementedError, CompilationError)): 941 raise e --> 942 raise CompilationError(fn.src, node) from e 943 ret = generator.module 944 # module takes ownership of the context