Azure-Samples / qdrant-azure

Qdrant Vector Database on Azure Cloud
MIT License
89 stars 22 forks source link

Working deloyment stopped working #32

Open philip-kuhn opened 7 months ago

philip-kuhn commented 7 months ago

Hi guys. We've been running a stock standard Azure Container Apps deployment successfully since December 4th 2023. Its been running fine with successful data querying, as of COB last Friday. Since Monday morning the container is crashing and cant start up. There is nothing I'm aware of that ran or was done on our side as part of an automated or human process that did anything to the resource. This is the second deployment that this has occurred with (ran well for a few weeks then suddenly the container started crashing) and I'm struggling to understand why. The log stream shows:

2024-01-17T06:32:52.25782 Connecting to the container 'qdrantapicontainerapp'... 2024-01-17T06:32:52.27576 Successfully Connected to container: 'qdrantapicontainerapp' [Revision: 'sygniasynapseqdranthttp--0tfisge-567f7bd697-5hr52', Replica: 'sygniasynapseqdranthttp--0tfisge'] 2024-01-17T06:32:37.835011814Z 2: std::panicking::rust_panic_with_hook 2024-01-17T06:32:37.835016242Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:735:13 2024-01-17T06:32:37.835020700Z 3: std::panicking::begin_panic_handler::{{closure}} 2024-01-17T06:32:37.835024728Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:609:13 2024-01-17T06:32:37.835028695Z 4: std::sys_common::backtrace::rust_end_short_backtrace 2024-01-17T06:32:37.835032312Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/sys_common/backtrace.rs:170:18 2024-01-17T06:32:37.835037161Z 5: rust_begin_unwind 2024-01-17T06:32:37.835041559Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5 2024-01-17T06:32:37.835046238Z 6: core::panicking::panic_fmt 2024-01-17T06:32:37.835049925Z at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14 2024-01-17T06:32:37.835055635Z 7: collection::shards::shard_holder::ShardHolder::load_shards::{{closure}}.110038 2024-01-17T06:32:37.835059663Z 8: storage::content_manager::toc::TableOfContent::new 2024-01-17T06:32:37.835063911Z 9: qdrant::main 2024-01-17T06:32:37.835067928Z 10: std::sys_common::backtrace::__rust_begin_short_backtrace 2024-01-17T06:32:37.835072066Z 11: main 2024-01-17T06:32:37.835075943Z 12: 2024-01-17T06:32:37.835079880Z 13: libc_start_main 2024-01-17T06:32:37.835083507Z 14: _start 2024-01-17T06:32:37.835086994Z
2024-01-17T06:32:37.835092183Z 2024-01-17T06:32:37.834886Z ERROR qdrant::startup: Panic occurred in file /qdrant/lib/collection/src/shards/replica_set/mod.rs at line 246: Failed to load local shard "./storage/collections/[redacted]/0": Service internal error: RocksDB open error: IO error: No such file or directory: while unlink() file: ./storage/collections/[redacted]/0/segments/23a17757-59d1-4649-acbb-7b5b183af4bb/LOG.old.1705084754144915: No such file or directory

If I browse to the file its looking for, in the Azure portal, its reported as being marked for deletion by an SMB client. As far as I know there was no human action that did this, and all other files are accessible. This is also the only fine that has contents. All the other LOG.old files are 0 size. We cant delete the file because its already marked for deletion, so I cant upload any sort of replacement file, so short of redeploying everything, I'm not sure where to go from here. I set the soft delete period to the minimum (1 day) in the hopes that once the file deleted it would sort itself out, but the file hasn't deleted and is still present but inaccessible. I'm really hoping I don't have to do a complete redeploy to fix this, so any assistance you can give to help understand why this has happened, would be highly appreciated.

Thanks so much

Please provide us with the following information:

This issue is for a: (mark with an x)

- [X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

No idea. It was working fine and then something marked the log.OLD file for deletion

Any log messages given by the failure

Expected/desired behavior

That the working deployment continues to work

OS and Version?

Azure Container Service , so probably Linux

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

tawalke commented 5 months ago

@philip-kuhn Is this still an issue? My team and I also added Qdrant as an add-on to ACA so you can just use that as well: https://learn.microsoft.com/en-us/azure/container-apps/add-ons-qdrant

If it is running or not working on ACA as a current deployment please file a support case

philip-kuhn commented 5 months ago

Hi Tara

When last I looked at it, it was. We needed to get our chat service up and running again, and I wasnt prepared to recreate the resource and restore the data for a second time, only for this to happen again, for a third time, in a month. We moved to Azure AI Search.

Thanks for your efforts