Context

https://airbytehq.slack.com/archives/C06FZ238P8W/p1710320283398979

This problem was discovered as a doubt originally for attaining NRT streaming in Amazon SQS using a custom .py script, which uses PyAirbyte to call the read() in a loop.

Amazon SQS has the option to delete messages on read, and it's a part of the configuration. It is clearly added in my configuration for source-amazon-sqs.
But, calling read() again and again, somehow returns back the last read data, even if it supposed to have been deleted.

Description

The supposed source code was read, and it was decided to use result = source.read(cache=None,write_strategy="replace",force_full_refresh=True)
This piece of code indicates that we're using the default cache (DuckDB), and with a write strategy to the cache, as "replace", and also using a full refresh to drain out any previously read records further.
But, unlike the expected behavior, this DID NOT rewrite the cached dataset. it instead returned the last stored cache replace.
ADDITIONAL CONTEXT: There were 0 records processed at the start of the script, and this was only identified when the script was ran again and again manually, especially from the start.

Workarounds

Deleting cache everytime the custom script runs, before read() (The most viable option to ensure preserving the functionality of read() )
Using get_records to skip the concept of caching itself. - (This means there will be complications when multiple streams are to be selected)

I'm willing to contribute further to this issue, and you could assign me for doing any changes.

airbytehq / PyAirbyte

Bug: Consecutive `read()` operations returns previously stored cache when it's not supposed to #138

Context

Description

Workarounds