ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.06k stars 220 forks source link

Internal error while accessing SQLite database: locking protocol #2214

Open grid-beep opened 1 year ago

grid-beep commented 1 year ago

When adapting Mask2Former to Pytorch-ROCm, I am facing a MIOpen Error: /.../data/driver/MLOpen/src/sqlite_db.cpp:209: Internal error while accessing SQLite database: locking protocol.

Python: 3.8.16 GPU: GFX90 PyTorch is installed by: pip install torch==1.13.1+rocm5.2 torchvision==0.14.1+rocm5.2 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/rocm5.2

The error logs are as follows:

MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/sqlite_db.cpp:209: Internal error while accessing SQLite database: locking protocol
MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/sqlite_db.cpp:209: Internal error while accessing SQLite database: locking protocol
MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/sqlite_db.cpp:209: Internal error while accessing SQLite database: locking protocol
MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/sqlite_db.cpp:209: Internal error while accessing SQLite database: locking protocol
Traceback (most recent call last):
  File "proj_nococo/train_net_video.py", line 280, in <module>
    launch(
  File "/patha/detectron2/detectron2/engine/launch.py", line 69, in launch
    mp.start_processes(
  File "/pathb/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/pathb/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/pathb/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/patha/detectron2/detectron2/engine/launch.py", line 123, in _distributed_worker
    main_func(*args)
  File "/patha/proj/proj_nococo/train_net_video.py", line 274, in main
    return trainer.train()
  File "/patha/detectron2/detectron2/engine/defaults.py", line 484, in train
    super().train(self.start_iter, self.max_iter)
  File "/patha/detectron2/detectron2/engine/train_loop.py", line 155, in train
    self.run_step()
  File "/patha/detectron2/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/patha/detectron2/detectron2/engine/train_loop.py", line 492, in run_step
    loss_dict = self.model(data)
  File "/pathb/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/pathb/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/pathb/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/pathb/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/patha/proj/proj_nococo/mask2former_video/video_maskformer_model.py", line 458, in forward
    features = self.backbone(images.tensor)
  File "/pathb/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/patha/detectron2/detectron2/modeling/backbone/resnet.py", line 445, in forward
    x = self.stem(x)
  File "/pathb/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/patha/detectron2/detectron2/modeling/backbone/resnet.py", line 356, in forward
    x = self.conv1(x)
  File "/pathb/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/patha/detectron2/detectron2/layers/wrappers.py", line 127, in forward
    x = F.conv2d(
RuntimeError: miopenStatusInternalError

Logs with MIOPEN_LOG_LEVEL=6: proj.log

xb534 commented 1 year ago

I met the same problem.

JehandadKhan commented 1 year ago

@xb534 and @grid-beep Thanks for reaching out. It looks like the Python forking is not playing nice with our use of SQLite3. Can you please try again after setting the following env var in your shell.

export MIOPEN_DEBUG_DISABLE_SQL_WAL=1

This will disable the write-ahead logs (WAL) for SQLite, if the issue persists please share additional logs.

xb534 commented 1 year ago

@xb534 and @grid-beep Thanks for reaching out. It looks like the Python forking is not playing nice with our use of SQLite3. Can you please try again after setting the following env var in your shell.

export MIOPEN_DEBUG_DISABLE_SQL_WAL=1

This will disable the write-ahead logs (WAL) for SQLite, if the issue persists please share additional logs.

Thank you for your response. By adding export MIOPEN_USER_DB_PATH="xxx" to my shell file, the issue with the "locking protocol" was resolved; however, the explanation behind why this approach works remains unknown to me.

The next time I encounter such a problem and unable to solve it using the methods mentioned above, I will try to use the method you suggested and reply to you here.

xb534 commented 1 year ago

@xb534 and @grid-beep Thanks for reaching out. It looks like the Python forking is not playing nice with our use of SQLite3. Can you please try again after setting the following env var in your shell.

export MIOPEN_DEBUG_DISABLE_SQL_WAL=1

This will disable the write-ahead logs (WAL) for SQLite, if the issue persists please share additional logs.

@JehandadKhan Unfortunately, the same problem happened again, the following is the log file.

=========================================================================

Not set export MIOPEN_DEBUG_DISABLE_SQL_WAL=1 MIOpen(HIP): Info [get_device_name] Raw device name: gfx90a:sramecc+:xnack- MIOpen(HIP): Info [Handle] stream: 0, device_id: 0 MIOpen(HIP): Info [get_device_name] Raw device name: gfx90a:sramecc+:xnack- MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0 MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){ MIOpen(HIP): tensorDesc = 0x18d933e0 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, int *, int *){ MIOpen(HIP): tensorDesc = MIOpen(HIP): dataType = 1 MIOpen(HIP): nbDims = 4 MIOpen(HIP): dim.values = { 1 3 1024 1024 } MIOpen(HIP): stride.values = { 3145728 1048576 1024 1 } MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){ MIOpen(HIP): tensorDesc = 0x18d933e0 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, int *, int *){ MIOpen(HIP): tensorDesc = MIOpen(HIP): dataType = 1 MIOpen(HIP): nbDims = 4 MIOpen(HIP): dim.values = { 1280 3 16 16 } MIOpen(HIP): stride.values = { 768 256 16 1 } MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){ MIOpen(HIP): tensorDesc = 0x150a453b41c4 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, int *, int *){ MIOpen(HIP): tensorDesc = MIOpen(HIP): dataType = 1 MIOpen(HIP): nbDims = 4 MIOpen(HIP): dim.values = { 1 1280 64 64 } MIOpen(HIP): stride.values = { 5242880 4096 64 1 } MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenCreateConvolutionDescriptor(miopenConvolutionDescriptor_t *){ MIOpen(HIP): convDesc = 0x1 MIOpen(HIP): } MIOpen(HIP): Info [GetFindModeValueImpl] MIOPEN_FIND_MODE = DYNAMIC_HYBRID(5) MIOpen(HIP): miopenStatus_t miopenInitConvolutionNdDescriptor(miopenConvolutionDescriptor_t, int, int *, int *, int *, miopenConvolutionMode_t){ MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1}, MIOpen(HIP): spatialDim = 2 MIOpen(HIP): pads = { 0 0 } MIOpen(HIP): strides = { 16 16 } MIOpen(HIP): dilations = { 1 1 } MIOpen(HIP): c_mode = 0 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenSetConvolutionGroupCount(miopenConvolutionDescriptor_t, int){ MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): groupCount = 1 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenConvolutionForwardGetWorkSpaceSize(miopenHandle_t, const miopenTensorDescriptor_t, const miopenTensorDescriptor_t, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, size_t *){ MIOpen(HIP): handle = stream: 0, device_id: 0 MIOpen(HIP): wDesc = 1280, 3, 16, 16 MIOpen(HIP): yDesc = 1, 1280, 64, 64 MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): workSpaceSize = 0 MIOpen(HIP): } MIOpen(HIP): Info [ForwardGetWorkSpaceSize] MIOpen(HIP): Info [AmdRocmMetadataVersionDetect] ROCm MD version AMDHSA_COv3, HIP version 5.2.22304, MIOpen version 2.17.0.a82233c22 MIOpen(HIP): Info [GetForwardSolutions] MIOpen(HIP): Info2 [GetLibPath] Lib Path: /opt/rocm-5.2.3/lib/libMIOpen.so.1.0.50203 MIOpen(HIP): Info2 [GetInstalledPathFile] Found exact find database file: /opt/rocm-5.2.3/share/miopen/db/gfx90a6e.HIP.fdb.txt MIOpen(HIP): Info [Measure] ReadonlyRamDb::Prefetch time: 135.998 ms MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 20.0067 ms MIOpen(HIP): Info2 [ValidateUnsafe] DB file is newer than cache: 5141756518495091, 911071500580951 MIOpen(HIP): Info2 [FindRecord] RamDb file is newer than cache, prefetching MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.249534 ms MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 3-1024-1024-16x16-1280-64-64-1-0x0-16x16-1x1-0-NCHW-FP32-F in cache for file /users/anwerrao/.config/miopen//gfx90a6e.HIP.2_17_0_a82233c22.ufdb.txt MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 39.2806 ms MIOpen(HIP): Info2 [ForwardGetWorkSpaceSize] 12582912 MIOpen(HIP): miopenStatus_t miopenFindConvolutionForwardAlgorithm(miopenHandle_t, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, void *, const int, int *, miopenConvAlgoPerf_t *, void *, size_t, bool){ MIOpen(HIP): handle = stream: 0, device_id: 0 MIOpen(HIP): xDesc = 1, 3, 1024, 1024 MIOpen(HIP): x = 0x15048b200000 MIOpen(HIP): wDesc = 1280, 3, 16, 16 MIOpen(HIP): w = 0x150541b00000 MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): yDesc = 1, 1280, 64, 64 MIOpen(HIP): y = 0x150489c00000 MIOpen(HIP): requestAlgoCount = 1 MIOpen(HIP): returnedAlgoCount = 792077816 MIOpen(HIP): perfResults = MIOpen(HIP): workSpace = 0x1505b8640000 MIOpen(HIP): workSpaceSize = 12582912 MIOpen(HIP): exhaustiveSearch = 0 MIOpen(HIP): } MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 12582912 MIOpen(HIP): Info [GetForwardSolutions] MIOpen(HIP): Info2 [ValidateUnsafe] DB file is newer than cache: 5141756518495091, 911071539999861 MIOpen(HIP): Info2 [FindRecord] RamDb file is newer than cache, prefetching MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.048574 ms MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 3-1024-1024-16x16-1280-64-64-1-0x0-16x16-1x1-0-NCHW-FP32-F in cache for file /users/anwerrao/.config/miopen//gfx90a6e.HIP.2_17_0_a82233c22.ufdb.txt MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.241298 ms MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 3x1024x1024x16x16x1280x64x64x1xNCHWxFP32x0x0x16x16x1x1x1xF and solver GemmFwdRest MIOpen(HIP): Info [GetPerfDbPathFile] Found exact perf database file MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file /opt/rocm-5.2.3/share/miopen/db/gfx90a6e.db MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /users/anwerrao/.config/miopen/gfx90a6e_1.1.0.udb MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable) MIOpen(HIP): Info2 [Register] Invoker registered for algorithm 3x1024x1024x16x16x1280x64x64x1xNCHWxFP32x0x0x16x16x1x1x1xF and solver GemmFwdRest MIOpen(HIP): Info2 [SetAsFound1_0] Solver GemmFwdRest registered as find 1.0 best for miopenConvolutionFwdAlgoGEMM in 3x1024x1024x16x16x1280x64x64x1xNCHWxFP32x0x0x16x16x1x1x1xF MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM 0.392162 12582912 MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 12582912, 0.392162 MIOpen(HIP): miopenStatus_t miopenConvolutionForward(miopenHandle_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenConvolutionDescriptor_t, miopenConvFwdAlgorithm_t, const void *, const miopenTensorDescriptor_t, void *, void *, size_t){ MIOpen(HIP): handle = stream: 0, device_id: 0 MIOpen(HIP): alpha = 0x7ffe2f3627e0 MIOpen(HIP): xDesc = 1, 3, 1024, 1024 MIOpen(HIP): x = 0x15048b200000 MIOpen(HIP): wDesc = 1280, 3, 16, 16 MIOpen(HIP): w = 0x150541b00000 MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): algo = 0 MIOpen(HIP): beta = 0x7ffe2f3627e8 MIOpen(HIP): yDesc = 1, 1280, 64, 64 MIOpen(HIP): y = 0x150489c00000 MIOpen(HIP): workSpace = 0x1505b8640000 MIOpen(HIP): workSpaceSize = 12582912 MIOpen(HIP): } MIOpen(HIP): Command [LogCmdConvolution] ./bin/MIOpenDriver conv -n 1 -c 3 -H 1024 -W 1024 -k 1280 -y 16 -x 16 -p 0 -q 0 -u 16 -v 16 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 12582912 MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 3x1024x1024x16x16x1280x64x64x1xNCHWxFP32x0x0x16x16x1x1x1xF and algorithm miopenConvolutionFwdAlgoGEMM MIOpen(HIP): auto miopen::solver::GemmFwdRest::GetSolution(const miopen::ExecutionContext &, const conv::ProblemDescription &)::(anonymous class)::operator()(const std::vector &)::(anonymous class)::operator()(const miopen::Handle &, const miopen::AnyInvokeParams &) const{ MIOpen(HIP): name + ", non 1x1" = convolution, non 1x1 MIOpen(HIP): } MIOpen(HIP): Info2 [GetKernels] 0 kernels for key: miopenIm2d2Col "c3i1024_1024w16_16p0_0s16_16d1_1t1" MIOpen(HIP): Info2 [AddKernel] Key: miopenIm2Col "c3i1024_1024w16_16p0_0s16_16d1_1t1" MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file MIOpen(HIP): Info [KernDb] database not present MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /users/anwerrao/.cache/miopen/2.17.0.a82233c22/gfx90a6e.ukdb MIOpen(HIP): Info2 [Exec] PRAGMA journal_mode=WAL; MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/sqlite_db.cpp:209: Internal error while accessing SQLite database: locking protocol MIOpen(HIP): miopenStatus_t miopenDestroyConvolutionDescriptor(miopenConvolutionDescriptor_t){ MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){ MIOpen(HIP): tensorDesc = 1280, 3, 16, 16 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){ MIOpen(HIP): tensorDesc = 1, 1280, 64, 64 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){ MIOpen(HIP): tensorDesc = 1, 3, 1024, 1024 MIOpen(HIP): } /scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3191.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Traceback (most recent call last): File "train_net.py", line 305, in launch( File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/detectron2/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "train_net.py", line 292, in main res = Trainer.test(cfg, model) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/detectron2/detectron2/engine/defaults.py", line 608, in test results_i = inference_on_dataset(model, data_loader, evaluator) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/detectron2/detectron2/evaluation/evaluator.py", line 158, in inference_on_dataset outputs = model(inputs) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/ov-seg/open_vocab_seg/ovseg_model.py", line 209, in forward images_annotations = [self.sammaskgenerator.generate(image) for image in sam_images] File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/ov-seg/open_vocab_seg/ovseg_model.py", line 209, in images_annotations = [self.sammaskgenerator.generate(image) for image in sam_images] File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/automatic_mask_generator.py", line 163, in generate mask_data = self._generate_masks(image) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/automatic_mask_generator.py", line 206, in _generate_masks crop_data = self._process_crop(image, crop_box, layer_idx, orig_size) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/automatic_mask_generator.py", line 236, in _process_crop self.predictor.set_image(cropped_im) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/predictor.py", line 60, in set_image self.set_torch_image(input_image_torch, image.shape[:2]) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/predictor.py", line 89, in set_torch_image self.features = self.model.image_encoder(input_image) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/modeling/image_encoder.py", line 107, in forward x = self.patch_embed(x) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/modeling/image_encoder.py", line 392, in forward x = self.proj(x) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: miopenStatusInternalError
Set MIOPEN_DEBUG_DISABLE_SQL_WAL=1 MIOpen(HIP): Info [get_device_name] Raw device name: gfx90a:sramecc+:xnack- MIOpen(HIP): Info [Handle] stream: 0, device_id: 0 MIOpen(HIP): Info [get_device_name] Raw device name: gfx90a:sramecc+:xnack- MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0 MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){ MIOpen(HIP): tensorDesc = 0x32a34b30 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, int *, int *){ MIOpen(HIP): tensorDesc = MIOpen(HIP): dataType = 1 MIOpen(HIP): nbDims = 4 MIOpen(HIP): dim.values = { 1 3 1024 1024 } MIOpen(HIP): stride.values = { 3145728 1048576 1024 1 } MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){ MIOpen(HIP): tensorDesc = 0x32a34b30 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, int *, int *){ MIOpen(HIP): tensorDesc = MIOpen(HIP): dataType = 1 MIOpen(HIP): nbDims = 4 MIOpen(HIP): dim.values = { 1280 3 16 16 } MIOpen(HIP): stride.values = { 768 256 16 1 } MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){ MIOpen(HIP): tensorDesc = 0x148a3bca31c4 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, int *, int *){ MIOpen(HIP): tensorDesc = MIOpen(HIP): dataType = 1 MIOpen(HIP): nbDims = 4 MIOpen(HIP): dim.values = { 1 1280 64 64 } MIOpen(HIP): stride.values = { 5242880 4096 64 1 } MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenCreateConvolutionDescriptor(miopenConvolutionDescriptor_t *){ MIOpen(HIP): convDesc = 0x1 MIOpen(HIP): } MIOpen(HIP): Info [GetFindModeValueImpl] MIOPEN_FIND_MODE = DYNAMIC_HYBRID(5) MIOpen(HIP): miopenStatus_t miopenInitConvolutionNdDescriptor(miopenConvolutionDescriptor_t, int, int *, int *, int *, miopenConvolutionMode_t){ MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1}, MIOpen(HIP): spatialDim = 2 MIOpen(HIP): pads = { 0 0 } MIOpen(HIP): strides = { 16 16 } MIOpen(HIP): dilations = { 1 1 } MIOpen(HIP): c_mode = 0 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenSetConvolutionGroupCount(miopenConvolutionDescriptor_t, int){ MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): groupCount = 1 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenConvolutionForwardGetWorkSpaceSize(miopenHandle_t, const miopenTensorDescriptor_t, const miopenTensorDescriptor_t, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, size_t *){ MIOpen(HIP): handle = stream: 0, device_id: 0 MIOpen(HIP): wDesc = 1280, 3, 16, 16 MIOpen(HIP): yDesc = 1, 1280, 64, 64 MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): workSpaceSize = 0 MIOpen(HIP): } MIOpen(HIP): Info [ForwardGetWorkSpaceSize] MIOpen(HIP): Info [AmdRocmMetadataVersionDetect] ROCm MD version AMDHSA_COv3, HIP version 5.2.22304, MIOpen version 2.17.0.a82233c22 MIOpen(HIP): Info [GetForwardSolutions] MIOpen(HIP): Info2 [GetLibPath] Lib Path: /opt/rocm-5.2.3/lib/libMIOpen.so.1.0.50203 MIOpen(HIP): Info2 [GetInstalledPathFile] Found exact find database file: /opt/rocm-5.2.3/share/miopen/db/gfx90a6e.HIP.fdb.txt MIOpen(HIP): Info [Measure] ReadonlyRamDb::Prefetch time: 163.594 ms MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 16.442 ms MIOpen(HIP): Info2 [ValidateUnsafe] DB file is newer than cache: 5141756518495091, 911527976280975 MIOpen(HIP): Info2 [FindRecord] RamDb file is newer than cache, prefetching MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.309169 ms MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 3-1024-1024-16x16-1280-64-64-1-0x0-16x16-1x1-0-NCHW-FP32-F in cache for file /users/anwerrao/.config/miopen//gfx90a6e.HIP.2_17_0_a82233c22.ufdb.txt MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 10.391 ms MIOpen(HIP): Info2 [ForwardGetWorkSpaceSize] 12582912 MIOpen(HIP): miopenStatus_t miopenFindConvolutionForwardAlgorithm(miopenHandle_t, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, void *, const int, int *, miopenConvAlgoPerf_t *, void *, size_t, bool){ MIOpen(HIP): handle = stream: 0, device_id: 0 MIOpen(HIP): xDesc = 1, 3, 1024, 1024 MIOpen(HIP): x = 0x1483ff000000 MIOpen(HIP): wDesc = 1280, 3, 16, 16 MIOpen(HIP): w = 0x148532500000 MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): yDesc = 1, 1280, 64, 64 MIOpen(HIP): y = 0x1483fda00000 MIOpen(HIP): requestAlgoCount = 1 MIOpen(HIP): returnedAlgoCount = 1928470808 MIOpen(HIP): perfResults = MIOpen(HIP): workSpace = 0x1485a4040000 MIOpen(HIP): workSpaceSize = 12582912 MIOpen(HIP): exhaustiveSearch = 0 MIOpen(HIP): } MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 12582912 MIOpen(HIP): Info [GetForwardSolutions] MIOpen(HIP): Info2 [ValidateUnsafe] DB file is newer than cache: 5141756518495091, 911527986835156 MIOpen(HIP): Info2 [FindRecord] RamDb file is newer than cache, prefetching MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.040769 ms MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 3-1024-1024-16x16-1280-64-64-1-0x0-16x16-1x1-0-NCHW-FP32-F in cache for file /users/anwerrao/.config/miopen//gfx90a6e.HIP.2_17_0_a82233c22.ufdb.txt MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.258271 ms MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 3x1024x1024x16x16x1280x64x64x1xNCHWxFP32x0x0x16x16x1x1x1xF and solver GemmFwdRest MIOpen(HIP): Info [GetPerfDbPathFile] Found exact perf database file MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file /opt/rocm-5.2.3/share/miopen/db/gfx90a6e.db MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /users/anwerrao/.config/miopen/gfx90a6e_1.1.0.udb MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable) MIOpen(HIP): Info2 [Register] Invoker registered for algorithm 3x1024x1024x16x16x1280x64x64x1xNCHWxFP32x0x0x16x16x1x1x1xF and solver GemmFwdRest MIOpen(HIP): Info2 [SetAsFound1_0] Solver GemmFwdRest registered as find 1.0 best for miopenConvolutionFwdAlgoGEMM in 3x1024x1024x16x16x1280x64x64x1xNCHWxFP32x0x0x16x16x1x1x1xF MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM 0.392162 12582912 MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 12582912, 0.392162 MIOpen(HIP): miopenStatus_t miopenConvolutionForward(miopenHandle_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenConvolutionDescriptor_t, miopenConvFwdAlgorithm_t, const void *, const miopenTensorDescriptor_t, void *, void *, size_t){ MIOpen(HIP): handle = stream: 0, device_id: 0 MIOpen(HIP): alpha = 0x7ffe72f22300 MIOpen(HIP): xDesc = 1, 3, 1024, 1024 MIOpen(HIP): x = 0x1483ff000000 MIOpen(HIP): wDesc = 1280, 3, 16, 16 MIOpen(HIP): w = 0x148532500000 MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): algo = 0 MIOpen(HIP): beta = 0x7ffe72f22308 MIOpen(HIP): yDesc = 1, 1280, 64, 64 MIOpen(HIP): y = 0x1483fda00000 MIOpen(HIP): workSpace = 0x1485a4040000 MIOpen(HIP): workSpaceSize = 12582912 MIOpen(HIP): } MIOpen(HIP): Command [LogCmdConvolution] ./bin/MIOpenDriver conv -n 1 -c 3 -H 1024 -W 1024 -k 1280 -y 16 -x 16 -p 0 -q 0 -u 16 -v 16 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 12582912 MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 3x1024x1024x16x16x1280x64x64x1xNCHWxFP32x0x0x16x16x1x1x1xF and algorithm miopenConvolutionFwdAlgoGEMM MIOpen(HIP): auto miopen::solver::GemmFwdRest::GetSolution(const miopen::ExecutionContext &, const conv::ProblemDescription &)::(anonymous class)::operator()(const std::vector &)::(anonymous class)::operator()(const miopen::Handle &, const miopen::AnyInvokeParams &) const{ MIOpen(HIP): name + ", non 1x1" = convolution, non 1x1 MIOpen(HIP): } MIOpen(HIP): Info2 [GetKernels] 0 kernels for key: miopenIm2d2Col "c3i1024_1024w16_16p0_0s16_16d1_1t1" MIOpen(HIP): Info2 [AddKernel] Key: miopenIm2Col "c3i1024_1024w16_16p0_0s16_16d1_1t1" MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file MIOpen(HIP): Info [KernDb] database not present MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /users/anwerrao/.cache/miopen/2.17.0.a82233c22/gfx90a6e.ukdb MIOpen(HIP): Info2 [Exec] CREATE TABLE IF NOT EXISTS `kern_db` (`id` INTEGER PRIMARY KEY ASC,`kernel_name` TEXT NOT NULL,`kernel_args` TEXT NOT NULL,`kernel_blob` BLOB NOT NULL,`kernel_hash` TEXT NOT NULL,`uncompressed_size` INT NOT NULL);CREATE UNIQUE INDEX IF NOT EXISTS `idx_kern_db` ON kern_db(kernel_name, kernel_args); MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/sqlite_db.cpp:209: Internal error while accessing SQLite database: locking protocol MIOpen(HIP): miopenStatus_t miopenDestroyConvolutionDescriptor(miopenConvolutionDescriptor_t){ MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {16, 16}, {1, 1}, MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){ MIOpen(HIP): tensorDesc = 1280, 3, 16, 16 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){ MIOpen(HIP): tensorDesc = 1, 1280, 64, 64 MIOpen(HIP): } MIOpen(HIP): miopenStatus_t miopenDestroyTensorDescriptor(miopenTensorDescriptor_t){ MIOpen(HIP): tensorDesc = 1, 3, 1024, 1024 MIOpen(HIP): } /scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3191.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Traceback (most recent call last): File "train_net.py", line 305, in launch( File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/detectron2/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "train_net.py", line 292, in main res = Trainer.test(cfg, model) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/detectron2/detectron2/engine/defaults.py", line 608, in test results_i = inference_on_dataset(model, data_loader, evaluator) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/detectron2/detectron2/evaluation/evaluator.py", line 158, in inference_on_dataset outputs = model(inputs) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/ov-seg/open_vocab_seg/ovseg_model.py", line 209, in forward images_annotations = [self.sammaskgenerator.generate(image) for image in sam_images] File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/ov-seg/open_vocab_seg/ovseg_model.py", line 209, in images_annotations = [self.sammaskgenerator.generate(image) for image in sam_images] File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/automatic_mask_generator.py", line 163, in generate mask_data = self._generate_masks(image) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/automatic_mask_generator.py", line 206, in _generate_masks crop_data = self._process_crop(image, crop_box, layer_idx, orig_size) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/automatic_mask_generator.py", line 236, in _process_crop self.predictor.set_image(cropped_im) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/predictor.py", line 60, in set_image self.set_torch_image(input_image_torch, image.shape[:2]) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/predictor.py", line 89, in set_torch_image self.features = self.model.image_encoder(input_image) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/modeling/image_encoder.py", line 107, in forward x = self.patch_embed(x) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/pfs/lustrep2/scratch/xxx/cjl/xb/project2/segment-anything/segment_anything/modeling/image_encoder.py", line 392, in forward x = self.proj(x) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/scratch/xxx/anaconda3/envs/ovseg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: miopenStatusInternalError

I've done everything I can to figure this out, but it's a black box to me, don't know what works, hope this fixes it once and for all. sincere thanks.

donglixp commented 7 months ago

The same issue

ppanchad-amd commented 5 months ago

@JehandadKhan Created internal ticket to resolve this issue. Thanks!

jamesxu2 commented 2 months ago

Hi @donglixp @grid-beep @xb534 , could you provide some more information on what you were doing to arrive at the Internal error while accessing SQLite database: locking protocol error? Does this happen during installation, running one of the Mask2Former demos, training or something else?

When adapting Mask2Former to Pytorch-ROCm

I assume you're following the Mask2Former install steps, but I'm not sure if you're using the Deformable-DETR kernel that is used throughout the Mask2Former demos (like in /Mask2Former/demo$ python3 demo.py). In particular, the last two install steps include:

cd mask2former/modeling/pixel_decoder/ops
sh make.sh

to build the Deformable-DETR kernel from this repository. Are you converting this CUDA kernel to HIP and compiling it for ROCm hardware like the person in this issue is attempting to do? If so, please provide more information on how you're doing this.

If you can provide a specific set of steps to reproduce this issue, that would help significantly with the investigation.


As a final aside - ROCm maintains a repository of pretrained transformers from HuggingFace, which includes Mask2Former. I'm not sure if this fits your usecase but you may consider trying that instead.