Closed nikz closed 3 years ago
Hi Nik. This is an incredibly helpful synopsis.
Although we autogenerate the majority of the code, the PR process is still straightforward and can be ported to the origin repo. Would be happy to facilitate that for you, as well as collaborate on the fix.
The rubygems prompt to upgrade also a good idea. I've not heard or seen this reported yet, so I will dig in this week, and keep my eyes out for any PR suggestions you might already have :)
Hi @SerKnight, no problems.
Let me know the next steps in terms of a PR - suggestion would be to split the concept of Configuration (e.g urls, loggers, timeouts and the like) from Credentials (which should be per-instance). For instance, taking these lines and either popping them into their own object or just moving them to the ApiClient
class (IMHO the latter seems more straightforward, but I also assume there's a reason they are sitting globally now that I might be missing?)
Happy to send that as a PR but since it's a fairly major change (although not breaking afaict) I'd like your go-ahead first.
Definitely one that would be hard to spot if it was happening - I would say anyone who is running periodic syncs of many customers with Sidekiq might see it, or anyone doing things at high scale.
I think moving to ApiClient is most maintainable and follows some existing patterns. When you have a working concept PR I will get it integrated to the OpenAPI codegen repo, and tag you for a review as well. Will try to add any edge case tests around it as well.
Just thinking about this more though, the access_token is a JWT that is tied to the permissions in relation to a specific tenant. Worst case scenario I could see is a an incorrect user token gets set on the client, and an un-authorized 401
is triggered from the lack of combo permissions with the provided 'tenant-id'.
Nonetheless - 100% agree after seeing your sample code we should address.
Just thinking about this more though, the access_token is a JWT that is tied to the permissions in relation to a specific tenant. Worst case scenario I could see is a an incorrect user token gets set on the client, and an un-authorized 401 is triggered from the lack of combo permissions with the provided 'tenant-id'.
Ah yeah, that is a good point - protection given tenant IDs are UUIDs too. Reduces the bug surface area further because only non-tenant specific API calls would be affected (i.e getting tenant connections, key revocation, etc.). Like, in theory you could end up with the key swapped for the /connections
call and then end up with the wrong tenant list AND then the wrong data but it starts to be a vanishingly small chance of going wrong π
Any updates on this? I would class it as a fairly serious bug, and we have already experienced data pollution in our customer's integrations when our multi-threaded sync routine experienced token collision but still pulled the data from Xero. This looks like a major oversight by the development Q/A team!
Hey @CyberFerret - my sentiment is that we landed that because tenant id to required for every single API call that this was less of a vulnerability and more of an improvement that needs to be addressed for folks leveraging batch/bg processing.
Can you elaborate on the situation where you pulled/polluted data? Are you using the same instance of the XeroRuby::ApiClient.new(credentials: creds)
for many unrelated API calls to orgs and setting xero_client.set_token_set(user.token_set)
each job/process?
Def not saying this isn't a serious issue, its just the first its come up so I really want to understand the problem from the production system usage of you two.
In short of passing the access_token
in addition to the tenant_id
to every api call, not sure how this can be fixed without a major breaking change.
Lets discuss how this is cropping up more specifically and think our way around a breaking change π€
For Rubyists this really is a serious flaw and I personally wouldn't recommend using this library in its current state in a multi-threaded environment (which today is practically all production ruby environments).
After a brief further look through the code, this boils down to naive use of a global configuration object for instance-specific configuration.
I've outlined the flaws below and recommended quick-fixes that would solve the major issues. Should be relatively simple and quick fixes for the Xero API client team.
XeroClient
instances regardless of thread.set_access_token
will update the access token across all instances of ApiClient
regardless of thread.api_client.payroll
) modifies an effectively-global variable that ApiClient
uses to construct URLs.Points 1 & 2 mean that there is serious risk of READING & WRITING someone else's data from the incorrect Xero file in a multi-threaded environment
Global configuration object: https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/configuration.rb
Client instance configuration is applied here: https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L33-L54
This modifies the global Configuration
instance, thus affecting all instances of ApiClient
(regardless of thread).
Access Tokens are applied here via ApiClient, updating the global Configuration object:
https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L106-L115 https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby.rb#L463-L473
Retrieving a new token from the API also directly updates these global variables via above methods: https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L141-L155
Calling API-specific objects updates the global base URL used to construct request URLs.
# Example of race condition:
accounting = api_client.accounting
payroll = api_client.payroll
# Will fail, as contracted URL will use `Payroll` specific URL:
# Race conditions will impact this across multiple threads.
accounting.get_account(...)
ApiClient
instance should have its own unique Configuration
instance which it can modify safely.Change to and implement:
config = Configuration.new.copy_from(Configuration.default)
XeroRuby.configure
in ApiClient
instances to instead refer to @config
instance directly.# Change calls to instance specific config:
@config.access_token
base_url
to an option on call_api
, rather than modifying @config.base_url
in preparation for requests.
https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L176# eg...
def call_api(...)
# ...
connection = Faraday.new(:url => (opts[:base_url] || config.base_url), :ssl => ssl_options) do |conn|
# ...
# Need to modify build_request too to use opts[:base_url]
def build_request_url(path, base_url = @config.base_url)
To limit modifications to ApiClient
only, consider using a decorator for configuring API grouping objects: eg.
# API collection objects...
def payroll_nz_api
XeroRuby::PayrollNzApi.new(WrappedClient.new(self, @config.payroll_nz_url))
end
# Automatically infer the base_url for any call_api call made by API Collection objects.
WrappedClient < SimpleDelegator
def initialize(client, base_url)
@base_url = base_url
super(client)
end
def call_api(*args, **opts)
__getobj__.call_api(*args, { base_url: @base_url }.merge(opts))
end
end
Just to reiterate the seriousness regardless of some API calls requiring tenant_id
:
ApiClient#connections
. Thus, it is possible that another user's connections could be synced and stored. For customers with only a single Xero File connected, the application may smartly imply the tenant to use. This flaw could certainly result in the exposing of another customers Xero data under certain scenarios and race conditions. ApiClient#access_token
. Thus, another user's access token could be persisted in a race-condition.refresh_token
. As above, a race-condition could result in storing another users access token.Hey @CyberFerret - my sentiment is that we landed that because tenant id to required for every single API call that this was less of a vulnerability and more of an improvement that needs to be addressed for folks leveraging batch/bg processing.
Can you elaborate on the situation where you pulled/polluted data? Are you using the same instance of the
XeroRuby::ApiClient.new(credentials: creds)
for many unrelated API calls to orgs and settingxero_client.set_token_set(user.token_set)
each job/process?Def not saying this isn't a serious issue, its just the first its come up so I really want to understand the problem from the production system usage of you two.
In short of passing the
access_token
in addition to thetenant_id
to every api call, not sure how this can be fixed without a major breaking change.Lets discuss how this is cropping up more specifically and think our way around a breaking change π€
The problem in our system is twofold - When we were approved for partner status for our app, we were told that one non negotiable aspect of our partner agreement was that we would have to check for a broken API connection should the Xero user terminate the OAuth connection from the Xero end and note that in our app and inform our users. Because Xero (for some inexplicable reason) does not have the capability to send a webhook POST to our app when this disconnection event happens, we had to resort to polling the OAuth connections every hour to check that the connection was still valid.
This polling is run on a background worker server that spawns a separate thread for each Xero connected company to run the poll. Each polling process does a XeroRuby::ApiClient.new(credentials: creds)
to instantiate the xero_client
object, then does a xero_client.refresh_token_set(company.token_set)
using the old token sets that are stored in the Company model.
To validate that we have a connection to Xero, we then do a loop through the xero_client.connections
to pull the tenantID
for all the connections (since we now support one of our companies linked to multiple Xero companies via OAuth2).
For each connection, we do a xero_client.accounting_api.get_organisations(tenantId).organisations.first
to pull the Organisation name. If successful, we deem the connection to be still active, and we re-save the company name and tenantID in our Company model as the list of companies that are synced with our system.
As part of the polling routine, we re-save the current refresh_token into the Company model as well, ready for the next call <-- THIS IS THE CRITICAL BIT I believe.
Then, we have a separate routine that runs periodically which looks at the list of synced Xero companies for each of our customers, and runs an import routine (also multi-threaded).
From our audit, it looks like when the connection polling was run, the collision of access token across multiple threads meant that when the xero_client.connections
and the xero_client.accounting_api.get_organisations(tenantId).organisations.first
was pulling the list of companies from another, completely separate Xero user (Company A), and then saving the other users list of Xero tenantIDs and company names in the Company on a different thread (Company B).
Hence, the refresh_token from the other company is also saved into the Company B company model in a different thread.
This meant that when the second poll was done to sync the payroll data itself, when Company B goes to sync, the refresh_token and list of tenantIDs from Company A was used, resulting in Company A's employees being imported into Company B.
As you can appreciate, this is a serious data breach, and we are in discussion with our customers now about this. Because we have customers in Europe and NZ etc., we have to report this data breach as part of their privacy laws, and this can result in serious ramifications for us..
It should be noted that we used this exact method with the old OAuth 1 protocol, and plain JSON data with absolutely no issues for over 2 years. The new Xero gem, which was purported to be better, has been the cause of this issue.
@CyberFerret Your scenario is essentially the one I proposed above where the connections
of tenants are requested dynamically on an ongoing basis.
In principle, I see nothing nothing wrong with this approach - but certainly impacted by race-conditions with the flaws identified.
Just a quick comment, without seeing your code, it sounds like you may be able to mitigate the fundamental problem temporarily by instead of deferring to api_client.connections.map { |t| t['tenantId'] }
for the tenantIds, rather, in your background worker, use the existing ids persisted in your database for that connection and iterate over those.
@armstrjare I just saw your post soon after I posted mine above, and you hit the nail on the head. Thanks for your suggestion on mitigating - we may give that a shot. Right now, we have had to change our sync routine to not use threads, but to run each company refresh synchronously.
That fact, combined with the absurdly low 60 requests per minute API limiting that Xero stipulates, means that refreshing thousands of employees for our customers now takes nearly an hour to run, because we can no longer trust the concurrently threaded processes to be reliable as before.
We've never had issues like these with other payroll vendors we integrate with. Really gives me the impression that Xero are a kindergarten level system that is not in the same ballpark as most other serious payroll vendors!
@CyberFerret Another option might be to monkey patch XeroRuby::Configuration
to have a per-thread default object. This doesn't fix underlying design but would solve the immediate security risk with things like Sidekiq jobs etc.
# Monkey-patch for naive thread "safety"
def XeroRuby::Configuration.default
Thread.current[:xero_ruby__configuration_default] ||= new
end
We've never had issues like these with other payroll vendors we integrate with. Really gives me the impression that Xero are a kindergarten level system that is not in the same ballpark as most other serious payroll vendors!
I'm sure they're doing their best, but does seem like something slipped through the cracks in their effort to push first-party API clients. I think the issue is possibly related to the code-generation they're doing. I haven't looked into it. I agree it is surprising this slipped through because it should have stuck out pretty obviously during code review as something that's not right with the way the client is architected and as something likely to cause potential issues.
Hey all. Lots to digest here. Really helpful explanation. Iβm going to chew through all your suggestions again and put a pull request together as soon as I can.
Hey everyone. Doing some final testing of the PR but @armstrjare 's suggestions were fantastic. Worked through all areas this could pose an issue.
Releasing this asap as its not breaking and crucial issue, but please have a look at the PR and we can iterate on any other improvements or tests that would benefit the work.
@nikz in addition to adding thorough specs I extended your live example to show the fix.
https://github.com/XeroAPI/xero-ruby-oauth2-app/pull/40/files
Cleaned up logs:
Fixed in 2.9.1
release
π LGTM, nice work @SerKnight - only suggestion I'd make is to look at flagging the gem version so that folks get a Dependabot or similar PR to update and don't get surprised :)
Thank you for the quick work to fix this issue @SerKnight. Special thanks also to @armstrjare for the detailed breakdown of the issue and for his super useful tips for us that we used to mitigate the problem while waiting for the patch. Apologies also to everyone for my rather testy observations in my comments above, but my stress levels were pretty sky high when I saw that we were facing data breach repercussions caused by this bug. Hopefully it will all be smooth sailing from here on.
π LGTM, nice work @SerKnight - only suggestion I'd make is to look at flagging the gem version so that folks get a Dependabot or similar PR to update and don't get surprised :)
Yeah you mentioned that prior. Iβm not familiar with flagging a gem vsn but will do some reading and figure that out tomorrow.
@CyberFerret not a worry! Apologies you experienced the bug and glad we could collaborate to get it fixed. Issues like this donβt get solved without the sharing of hard learnings. Dm me @serknight_ Twitter Iβd like to send a token of appreciation.
Hi all! I just participated laterally and with a minor impact on this issue (2.9.0 not working due to a missing timeout config value). Thank you all guys and thank you @SerKnight for your kind answer to a minor impact within the big issue you where solving. @CyberFerret I really hope for no serious impact in your situation. Thanks
Hi folks,
@armstrjare pointed me toward a threadsafety issue with this gem, I've since PoC'd and confirmed it.
Problem
Essentially, lines like this set the current Access Token as a class (global) variable, and therefore across all threads.
This is a problem for threaded app servers (such as Puma) or background job processors (like Sidekiq) as their current access token could be swapped mid-processing by another thread, resulting in a potential issue where data from one Xero Tenant is exposed to another.
Proof of Concept
Here's the issue in action:
And some hacky sample code
Mitigation
Let me know if there's any way we can help!
Thanks,
Nik