Open philip-kuhn opened 10 months ago
@philip-kuhn Is this still an issue? My team and I also added Qdrant as an add-on to ACA so you can just use that as well: https://learn.microsoft.com/en-us/azure/container-apps/add-ons-qdrant
If it is running or not working on ACA as a current deployment please file a support case
Hi Tara
When last I looked at it, it was. We needed to get our chat service up and running again, and I wasnt prepared to recreate the resource and restore the data for a second time, only for this to happen again, for a third time, in a month. We moved to Azure AI Search.
Thanks for your efforts
Hi guys. We've been running a stock standard Azure Container Apps deployment successfully since December 4th 2023. Its been running fine with successful data querying, as of COB last Friday. Since Monday morning the container is crashing and cant start up. There is nothing I'm aware of that ran or was done on our side as part of an automated or human process that did anything to the resource. This is the second deployment that this has occurred with (ran well for a few weeks then suddenly the container started crashing) and I'm struggling to understand why. The log stream shows:
2024-01-17T06:32:52.25782 Connecting to the container 'qdrantapicontainerapp'... 2024-01-17T06:32:52.27576 Successfully Connected to container: 'qdrantapicontainerapp' [Revision: 'sygniasynapseqdranthttp--0tfisge-567f7bd697-5hr52', Replica: 'sygniasynapseqdranthttp--0tfisge'] 2024-01-17T06:32:37.835011814Z 2: std::panicking::rust_panic_with_hook 2024-01-17T06:32:37.835016242Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:735:13 2024-01-17T06:32:37.835020700Z 3: std::panicking::begin_panic_handler::{{closure}} 2024-01-17T06:32:37.835024728Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:609:13 2024-01-17T06:32:37.835028695Z 4: std::sys_common::backtrace::rust_end_short_backtrace 2024-01-17T06:32:37.835032312Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:170:18 2024-01-17T06:32:37.835037161Z 5: rust_begin_unwind 2024-01-17T06:32:37.835041559Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5 2024-01-17T06:32:37.835046238Z 6: core::panicking::panic_fmt 2024-01-17T06:32:37.835049925Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14 2024-01-17T06:32:37.835055635Z 7: collection::shards::shard_holder::ShardHolder::load_shards::{{closure}}.110038 2024-01-17T06:32:37.835059663Z 8: storage::content_manager::toc::TableOfContent::new 2024-01-17T06:32:37.835063911Z 9: qdrant::main 2024-01-17T06:32:37.835067928Z 10: std::sys_common::backtrace::__rust_begin_short_backtrace 2024-01-17T06:32:37.835072066Z 11: main 2024-01-17T06:32:37.835075943Z 12:
2024-01-17T06:32:37.835079880Z 13: libc_start_main
2024-01-17T06:32:37.835083507Z 14: _start
2024-01-17T06:32:37.835086994Z
2024-01-17T06:32:37.835092183Z 2024-01-17T06:32:37.834886Z ERROR qdrant::startup: Panic occurred in file /qdrant/lib/collection/src/shards/replica_set/mod.rs at line 246: Failed to load local shard "./storage/collections/[redacted]/0": Service internal error: RocksDB open error: IO error: No such file or directory: while unlink() file: ./storage/collections/[redacted]/0/segments/23a17757-59d1-4649-acbb-7b5b183af4bb/LOG.old.1705084754144915: No such file or directory
If I browse to the file its looking for, in the Azure portal, its reported as being marked for deletion by an SMB client. As far as I know there was no human action that did this, and all other files are accessible. This is also the only fine that has contents. All the other LOG.old files are 0 size. We cant delete the file because its already marked for deletion, so I cant upload any sort of replacement file, so short of redeploying everything, I'm not sure where to go from here. I set the soft delete period to the minimum (1 day) in the hopes that once the file deleted it would sort itself out, but the file hasn't deleted and is still present but inaccessible. I'm really hoping I don't have to do a complete redeploy to fix this, so any assistance you can give to help understand why this has happened, would be highly appreciated.
Thanks so much
This issue is for a: (mark with an
x
)Minimal steps to reproduce
Any log messages given by the failure
Expected/desired behavior
That the working deployment continues to work
OS and Version?
Versions
Mention any other details that might be useful