Closed dheeraj-crl closed 6 months ago
@dheeraj-crl thanks for reporting, this seems quite important. It might be that one of the active-record tests we are ignoring is covering that (we don't have full active-record test suite coverage yet). If you could find a way to generate an example using the rails bug report template this would improve considerably the time to find the bug origin. Maybe if you can run a console in one of your failing QA clusters and isolate a specific part of your schema ?
I'll look into it soon as well!
hey @BuonOmo this was discovered by me, cockroachdb support created the issue. I'll take a look at the link for the bug report template and see if I can expedite your ability to reproduce
@BuonOmo do you know if there is a template with crdb setup already? Maybe I just branch off this repo?
ah actually, this might be impossible to produce in a test environment. I think it might only occur when connected to a real cluster. Like you said, need to run in a console in the QA cluster, and can use that template as a guide
I guess you found a way to adapt the template. Otherwise I'm always using the bin/console
script as a basis for my tests, it may help you as well.
Those errors only in prod env are a pain... And without your context I might have a really hard time to reproduce! I'll see (hopefully next week) if I can get access to clusters for testing. Your help is still very appreciated though!
Ok was able to reproduce with something similar to the template. Here is what I ran:
Rails.logger.level = 0
ActiveRecord::Base.establish_connection(:cockroachdb)
ActiveRecord::Schema.define do
create_table :posts, force: true do |t|
t.string :other_id
end
end
class Post < CockroachDbApplicationRecord
end
Post.create_or_find_by!(other_id: '123')
this is what my CockraochDbApplicationRecord looks like:
class CockroachDbApplicationRecord < ActiveRecord::Base
self.abstract_class = true
self.implicit_order_column = :created_at
connects_to database: { writing: :cockroachdb, reading: :cockroachdb }
end
and this is what my database.yml looks like (notable is use_follower_reads_for_type_introspection is true):
production:
cockroachdb:
adapter: cockroachdb
username: ***
password: ***
host: ***
port: ***
encoding: utf8
strict: true
migrations_paths: db/cockroachdb_migrate
sslrootcert: ***
use_follower_reads_for_type_introspection: true
schema_cache_path: db/cockroachdb_schema_cache.yml
lifetime: 1800
hijack:
namer: AfCrdb::DbNamer
database: ***
Table locality must not matter cuz its clearly the default from the script that creates the posts table
Based on what you said I tried this script locally:
require "active_record"
ActiveRecord::Base.logger = Logger.new(STDOUT)
ActiveRecord::Base.configurations = {
"cockroachdb" => {
adapter: "cockroachdb",
host: "localhost",
port: 26257,
user: "root",
database: "ar_crdb_console",
use_follower_reads_for_type_introspection: true
}
}
ActiveRecord::Base.establish_connection(:cockroachdb)
ActiveRecord::Schema.define do
create_table :posts, force: true do |t|
t.string :other_id
end
end
class CockroachDbApplicationRecord < ActiveRecord::Base
self.abstract_class = true
self.implicit_order_column = :created_at
connects_to database: { writing: :cockroachdb, reading: :cockroachdb }
end
class Post < CockroachDbApplicationRecord
end
Post.create_or_find_by!(other_id: '123')
It doesn't reproduce any error for me. Could you try it in your environment and if it works describe me your crdb setup?
Note that to run this script you must have active_record
and this adapter in your Gemfile and run it with bundle exec
You have to run it against an actual cluster, the error doesn't occur locally.
I also don't get an error when running your script locally, but do get it when running against a cluster
You have to run it against an actual cluster, the error doesn't occur locally.
Could you inform me more about your cluster setting?
plan_type: Dedicated standard cloud: aws capacity: 9 nodes, 4 vCPUs, 35 GiB disk regions: us-east-1, us-east-2, us-west-2 Version: v23.1.15
Hi @grantpaulson6,
We've been working on the issue recently. Unfortunately, even in a cluster there is still no reproduction on my side. The cluster is fairly similar to yours, so the error is more likely to hide in configurations or machine state. FYI, here's our cluster configuration:
plan_type: Dedicated standard
cloud: aws
capacity: 9 nodes, 2 vCPUs, 15 GiB disk
regions: us-east-1, us-east-2, us-west-2
Version: v23.1.14
With that said, I've updated the script to generate the list of raw queries. Could you run the new script and paste the output of the sql-trace
file (there should be no need of redacting anything, but I'd advise to check just in case)? We could compare that with the output I had and see if there is a difference at that level
I'll also compare the output of that script with both the current adapter version (7.1) and 7.0 to see if we can spot an important change.
the cockroach_db_adapter_extensions that is the line in question is a monkey patch that we use to set our target db name. it looks like this:
def execute_and_clear(sql, *args, **opts)
@hijacker&.hijack_connection
super
end
where line 19 is the super
Which shows that line 592 invoking the method add_pg_decoders
is the originator of the SQL that triggers the error
I was able to reproduce by running those raw SQL commands in the cockroachdb sql client connected to the cluster
The main difference I see is that in the new version the transaction is started before we do the pg_type queries
@grantpaulson6 you're absolutely right, and it seems to be the culprit as well. You have a transaction in which there are both aost and non-aost queries. One great thing would be to have the full stacktrace for the begin
call in the latest code version. BTW what is the precise commit of the adapter you're working with? (and commit of rails)
activerecord-cockroachdb-adapter (7.1.0) rails (7.1.2)
Note: some of the stack traces may differ because i've run these on different applications, instead of continually changing deployed code. But both app's have identically reproducible issues when on 7.1
What is the function of adding materialize_transactions: false
here? maybe thats related
There is still some af_crdb
code in the stack trace. I'd be curious of what is in the #configure_connection
method over there. (/home/code/vendor/bundle/ruby/3.2.0/gems/af_crdb-1.11.0/lib/af_crdb/active_record_extensions/cockroach_db_adapter_extensions.rb:49)
Also I think the culprit of creating the transaction is af_job
(/home/code/vendor/bundle/ruby/3.2.0/gems/af_job-40.1.0/lib/af_job/models/queue_txn_hook.rb:9:in `transaction'). Which actually makes sense.
For your own database, I would advise doing the setup work beforehand. (usually before forking processes or threads). This should actually solve the issue here as it seems to be that the setup is done within a transaction, where it shouldn't be.
So you early guess was right:
Weird thing is, not all rails models trigger it when calling create_or_find_by!. My guess is that a unique index or foreign key might have something to do with which models trigger it? Hard to discern a pattern.
With all that said, even though I think refactoring this bit of your codebase would be for the best (better response time, more memory efficient). I'd still agree that it is a pain we have an error here.
Trying to work out a reproduction, but I still cannot have the same error as you, even if I try to run everything within a transaction, there is still no failure... I'm more and more thinking that there is one of your af_
extensions that is somehow generating a transaction, running a query that doesn't contain AOST, and calling the type loading (which does contain AOST and in turn causes an issue). Is there a way I could access to those files ? (maybe by email, buonomo dot ulysse at gmail dot com)
The materialize_transaction: false
is what's in the original rails method. I try to stick to rails behaviour as much as possible. As to why it was changed in rails, it comes from this commit, and from my understanding, we use this to have the queries retryable (without having to retry a full transaction) in case of failure. Could you run a branch without these clauses to see if this is were the error come from maybe?
There are still some unattended questions:
In the mean time, I think a hotfix for you would be to overwrite the two methods that are using an aost query:
module AOSTHotfix
%i(load_additional_types add_pg_decoders).each do |method_name|
define_method(method_name) do |*args, **kwargs|
if @config[:use_follower_reads_for_type_introspection] && transaction_open?
execute("SET TRANSACTION AS OF SYSTEM TIME '-10s'")
method(method_name).super_method.super_method.call(*args, **kwargs)
else
super
end
end
end
end
ActiveRecord::ConnectionAdapters::CockroachDBAdapter.prepend AOSTHotfix
Unfortunately, as I still do not have a reproduction, I couldn't try this hotfix. But it should work !
Sorry for the doubloon, I think the stacktraces you sent are not full, could it be ? (the root is always create_or_find_by!
and I don't see the parent script. How did you generate those ?
I'll get back to you soon, have had other stuff going on
I'm also happy to zoom/pair on this if you're up for it
I think the stacktraces you sent are not full, could it be ?
you're correct, I dropped a bit of the stack trace, it had stuff like kernel
that didn't seem useful
I'd be curious of what is in the #configure_connection method
configure_connection
in af_crdb is below. The line in the stack trace is the super
call, so I don't think its responsible.
def configure_connection
@hijacker = AfRuntime::Db::HijackerBase.create(connection: self, config: @config)
super
end
FYI the HijackerBase is responsible for ultimately calling USE <database name>
with the correct database name.
Also I think the culprit of creating the transaction is af_job
I don't think af_job
is the responsible either, that class in the stacktrace is just setting a Thread variable. The line in the trace is the super
call. And there is no difference in this code between 7.0 and 7.1
I would advise doing the setup work beforehand
We already have a schema cache setup, which already has all table/column info already loaded into memory, so its unclear to me why this pg_type
query is occurring in the first place.
Is there a way I could access to those files?
I think pairing together on this would be the easiest and fastest way to share more detailed information
I see you have a branch I'll test it on that.
well, I now think I was wrong about needing to be connected to an actual cluster. The reason it wasn't triggering the error locally is the transaction wasn't beginning before the as of system time
query like it is in QA env. So for whatever reason, our QA environment config on rails 7.1 is causing that transaction to begin early.
@grantpaulson6
I'm also happy to zoom/pair on this if you're up for it
Email me to see when, this may be way more efficient indeed...
bit of a breakthrough: I just discovered that the transaction starts early when using a standalone mysql database as well (and using the same steps as the script to reproduce). So its not that cockroachdb is causing the transaction to start early, the rails 7.1 ugrade changed that behavior. Its just that the timestamp consistency feature of cockroach db errors out, while mysql doesn't have that and doesn't care.
@grantpaulson6 nice ! There are still two important points for me here:
@BuonOmo its not anything to do with our codebase. Just create a vanilla rails app. I reproduced with the following steps:
rails new repoducebugpg --database=postgresql
rails db:create
then from the rails console:
Rails.logger.level = 0
ActiveRecord::Base.establish_connection(:primary)
ActiveRecord::Schema.define do
create_table :posts, force: true do |t|
t.string :other_id
end
end
class Post < ApplicationRecord
end
class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter
def log(sql, name = "SQL", binds = [], type_casted_binds = [], statement_name = nil, async: false, &block)
puts "------ #{name&.chomp || "anonymous query"} ------\n-- \n#{sql.chomp};\n\n"
lines = caller_locations
lines.each { |line| puts line }
super
end
end
Post.create_or_find_by!(other_id: '123')
And you can observe that the Transaction starts before the type introspection query
@grantpaulson6 I still do not reproduce... For me all of the queries that have an aost are ran before the transaction.
rails new repoducebugpg --database=postgresql
# change pg for activerecord-cockroachdb-adapter in gemfile
# change database.yml's default section (see below)
bundle
bin/rails db:create
bin/rails console
# paste and run
And the database.yml:
default: &default
adapter: cockroachdb
port: 26257
host: localhost
user: root
encoding: unicode
use_follower_reads_for_type_introspection: true
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
Even in a cluster configuration:
default: &default
adapter: cockroachdb
port: 26257
host: <%= ENV.fetch("COCKROACH_HOST") { "localhost" } %>
user: <%= ENV.fetch("COCKROACH_USER") { "root" } %>
password: <%= ENV.fetch("COCKROACH_PASSWORD") { "" } %>
encoding: unicode
use_follower_reads_for_type_introspection: true
# For details on connection pooling, see Rails configuration guide
# https://guides.rubyonrails.org/configuring.html#database-pooling
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
sslrootcert: <%= ENV.fetch("COCKROACH_SSL") { "" } %>
development:
<<: *default
database: <%= ENV.fetch("COCKROACH_DATABASE") { "repoducebugpg_development" } %>
I still do not reproduce the issue...
Also your logs show ruby 2.7 and rails 7.0, is it normal ?
I'm running with rails 7.1 now and actually connecting to a crdb cluster for a vanilla rails app. Question though, how are you loading the CockroachDBAdapter? It appears that it isn't loaded automatically by the gem. Our in-house gem af_crdb explicitly required it require 'active_record/connection_adapters/cockroachdb_adapter'
. But as I'm trying to boot up a vanilla rails app, just adding the crdb gem doesn't actually load the adapter class. And then when I add in that require statement, the adapter used is still PostgreSQLAdapter
not CockroachDBAdapter
Also if you could share your logs for your tests too that would be helpful
scratch that; it was a config issue on my end. I used the key url
in the db config to set most of the important bits for connections to our qa cluster, and it was overriding my adapter name.
Anyways, I was able to reproduce the issue with a vanilla rails app on rails 7.1 connected to a cluster. I'm going to create a repo and share it with you so you can use it to reproduce
@BuonOmo this is the repo to reproduce, check out the README. I've added you as a collaborator on the repo.
@grantpaulson6 thank you, I finally reproduce ! I don't know the difference with the repo I made yet, but I'll find it. Now that I have a local reproduction, that should not be so hard to fix!
🔝
@grantpaulson6 this should be it ! It passes your reproduction :)
If you'd be able to try it on your real case it'd be nice
the fix worked! 🎉
Your ruby gem activerecord-cockroachdb-adapter had a recent release to support rails 7.1. When using it, and the setting
use_follower_reads_for_type_introspection
= true, the error: "PG::FeatureNotSupported: ERROR: inconsistent AS OF SYSTEM TIME timestamp; expected: 1708547400.494068015,0, got: 1708547390.502956628,0 HINT: try SET TRANSACTION AS OF SYSTEM TIME" is raised from this line: https://github.com/cockroachdb/activerecord-cockroachdb-adapter/blob/9506aa79785e7e945ae86bc6579ce7330e38635e/lib/active_record/connection_adapters/cockroachdb_adapter.rb#L592No application queries can execute, the error is triggered before they can run.
When setting
use_follower_reads_for_type_introspection
= false, there isn't any error.More info: The error doesn't occur locally, it occurs using our qa cluster. And it appears the method that triggers it is
create_or_find_by!
I think the error doesn't occur locally because its not a multi-region db, and it has different table localities.I confirmed the error behavior in another application of ours too.
Weird thing is, not all rails models trigger it when calling
create_or_find_by!
. My guess is that a unique index or foreign key might have something to do with which models trigger it? Hard to discern a pattern.One other note: if the error doesn't occur first (isn't the first query run) it doesn't occur at all. Meaning, if I run some other query, then the offending
create_or_find_by!
, it doesn't throw an error.Conversely, when the error does occur, all other subsequent queries will trigger an error. This error is different: