XeroAPI / xero-ruby

Xero Ruby SDK for OAuth 2.0 generated from XeroAPI/Xero-OpenAPI
http://developer.xero.com/
MIT License
57 stars 91 forks source link

Access token config is not thread safe, this could lead to a race condition where the wrong access token is used #145

Closed nikz closed 3 years ago

nikz commented 3 years ago

Hi folks,

@armstrjare pointed me toward a threadsafety issue with this gem, I've since PoC'd and confirmed it.

Problem

Essentially, lines like this set the current Access Token as a class (global) variable, and therefore across all threads.

This is a problem for threaded app servers (such as Puma) or background job processors (like Sidekiq) as their current access token could be swapped mid-processing by another thread, resulting in a potential issue where data from one Xero Tenant is exposed to another.

Proof of Concept

Here's the issue in action:

output

And some hacky sample code

Mitigation

  1. Configuration that could change per-thread should probably be on an instance of some kind rather than the XeroRuby module (or otherwise limited to the current thread). I'm happy to submit a PR to that effect, however given it would appear some of this code is autogenerated I'm not sure of the best way to do that.
  2. It would also be great to report a vulnerability to Rubygems so that folks are prompted to upgrade their gem versions (a number of people are potentially affected and this could lead to some unexpected information disclosure)

Let me know if there's any way we can help!

Thanks,

Nik

SerKnight commented 3 years ago

Hi Nik. This is an incredibly helpful synopsis.

Although we autogenerate the majority of the code, the PR process is still straightforward and can be ported to the origin repo. Would be happy to facilitate that for you, as well as collaborate on the fix.

The rubygems prompt to upgrade also a good idea. I've not heard or seen this reported yet, so I will dig in this week, and keep my eyes out for any PR suggestions you might already have :)

nikz commented 3 years ago

Hi @SerKnight, no problems.

Let me know the next steps in terms of a PR - suggestion would be to split the concept of Configuration (e.g urls, loggers, timeouts and the like) from Credentials (which should be per-instance). For instance, taking these lines and either popping them into their own object or just moving them to the ApiClient class (IMHO the latter seems more straightforward, but I also assume there's a reason they are sitting globally now that I might be missing?)

Happy to send that as a PR but since it's a fairly major change (although not breaking afaict) I'd like your go-ahead first.

Definitely one that would be hard to spot if it was happening - I would say anyone who is running periodic syncs of many customers with Sidekiq might see it, or anyone doing things at high scale.

SerKnight commented 3 years ago

I think moving to ApiClient is most maintainable and follows some existing patterns. When you have a working concept PR I will get it integrated to the OpenAPI codegen repo, and tag you for a review as well. Will try to add any edge case tests around it as well.


Just thinking about this more though, the access_token is a JWT that is tied to the permissions in relation to a specific tenant. Worst case scenario I could see is a an incorrect user token gets set on the client, and an un-authorized 401 is triggered from the lack of combo permissions with the provided 'tenant-id'.

Nonetheless - 100% agree after seeing your sample code we should address.

nikz commented 3 years ago

Just thinking about this more though, the access_token is a JWT that is tied to the permissions in relation to a specific tenant. Worst case scenario I could see is a an incorrect user token gets set on the client, and an un-authorized 401 is triggered from the lack of combo permissions with the provided 'tenant-id'.

Ah yeah, that is a good point - protection given tenant IDs are UUIDs too. Reduces the bug surface area further because only non-tenant specific API calls would be affected (i.e getting tenant connections, key revocation, etc.). Like, in theory you could end up with the key swapped for the /connections call and then end up with the wrong tenant list AND then the wrong data but it starts to be a vanishingly small chance of going wrong πŸ˜‚

CyberFerret commented 3 years ago

Any updates on this? I would class it as a fairly serious bug, and we have already experienced data pollution in our customer's integrations when our multi-threaded sync routine experienced token collision but still pulled the data from Xero. This looks like a major oversight by the development Q/A team!

SerKnight commented 3 years ago

Hey @CyberFerret - my sentiment is that we landed that because tenant id to required for every single API call that this was less of a vulnerability and more of an improvement that needs to be addressed for folks leveraging batch/bg processing.

Can you elaborate on the situation where you pulled/polluted data? Are you using the same instance of the XeroRuby::ApiClient.new(credentials: creds) for many unrelated API calls to orgs and setting xero_client.set_token_set(user.token_set) each job/process?

Def not saying this isn't a serious issue, its just the first its come up so I really want to understand the problem from the production system usage of you two.


In short of passing the access_token in addition to the tenant_id to every api call, not sure how this can be fixed without a major breaking change.

Lets discuss how this is cropping up more specifically and think our way around a breaking change πŸ€”

armstrjare commented 3 years ago

For Rubyists this really is a serious flaw and I personally wouldn't recommend using this library in its current state in a multi-threaded environment (which today is practically all production ruby environments).

After a brief further look through the code, this boils down to naive use of a global configuration object for instance-specific configuration.

I've outlined the flaws below and recommended quick-fixes that would solve the major issues. Should be relatively simple and quick fixes for the Xero API client team.

Key Flaws

  1. Any token request call to retrieve a new token via the api will update the globally configured active access token across all active XeroClient instances regardless of thread.
  2. Likewise, the application of a stored access token via set_access_token will update the access token across all instances of ApiClient regardless of thread.
  3. In a multi-threaded environment, the incorrect API URL may be used for different APIs due to race-conditions, as calling the API-specific sub-object (eg. api_client.payroll) modifies an effectively-global variable that ApiClient uses to construct URLs.

Points 1 & 2 mean that there is serious risk of READING & WRITING someone else's data from the incorrect Xero file in a multi-threaded environment

Problem Code/Design

Global configuration object: https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/configuration.rb

Client instance configuration is applied here: https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L33-L54

This modifies the global Configuration instance, thus affecting all instances of ApiClient (regardless of thread).

Access Tokens are applied here via ApiClient, updating the global Configuration object:

https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L106-L115 https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby.rb#L463-L473

Retrieving a new token from the API also directly updates these global variables via above methods: https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L141-L155

Calling API-specific objects updates the global base URL used to construct request URLs.

https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L82-L86

# Example of race condition:
accounting = api_client.accounting
payroll = api_client.payroll

# Will fail, as contracted URL will use `Payroll` specific URL:
# Race conditions will impact this across multiple threads.
accounting.get_account(...)

Recommendations

  1. Clone default Configuration instance upon ApiClient creation. Each ApiClient instance should have its own unique Configuration instance which it can modify safely.

https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L51

Change to and implement:

config = Configuration.new.copy_from(Configuration.default)
  1. Replace calls to XeroRuby.configure in ApiClient instances to instead refer to @config instance directly.

https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L103

# Change calls to instance specific config:
@config.access_token
  1. Extract use of base_url to an option on call_api, rather than modifying @config.base_url in preparation for requests. https://github.com/XeroAPI/xero-ruby/blob/e41d75c48bf42b94f601eb360d9cfafd0eb12d95/lib/xero-ruby/api_client.rb#L176
# eg...
def call_api(...)
  # ...
  connection = Faraday.new(:url => (opts[:base_url] || config.base_url), :ssl => ssl_options) do |conn|
  # ...

# Need to modify build_request too to use opts[:base_url]
def build_request_url(path, base_url = @config.base_url)

To limit modifications to ApiClient only, consider using a decorator for configuring API grouping objects: eg.

# API collection objects...
def payroll_nz_api
  XeroRuby::PayrollNzApi.new(WrappedClient.new(self, @config.payroll_nz_url))
end
# Automatically infer the base_url for any call_api call made by API Collection objects.
WrappedClient < SimpleDelegator
  def initialize(client, base_url)
    @base_url = base_url
    super(client)
  end

  def call_api(*args, **opts) 
     __getobj__.call_api(*args, { base_url: @base_url }.merge(opts))
  end
end
armstrjare commented 3 years ago

Just to reiterate the seriousness regardless of some API calls requiring tenant_id:

CyberFerret commented 3 years ago

Hey @CyberFerret - my sentiment is that we landed that because tenant id to required for every single API call that this was less of a vulnerability and more of an improvement that needs to be addressed for folks leveraging batch/bg processing.

Can you elaborate on the situation where you pulled/polluted data? Are you using the same instance of the XeroRuby::ApiClient.new(credentials: creds) for many unrelated API calls to orgs and setting xero_client.set_token_set(user.token_set) each job/process?

Def not saying this isn't a serious issue, its just the first its come up so I really want to understand the problem from the production system usage of you two.

In short of passing the access_token in addition to the tenant_id to every api call, not sure how this can be fixed without a major breaking change.

Lets discuss how this is cropping up more specifically and think our way around a breaking change πŸ€”

The problem in our system is twofold - When we were approved for partner status for our app, we were told that one non negotiable aspect of our partner agreement was that we would have to check for a broken API connection should the Xero user terminate the OAuth connection from the Xero end and note that in our app and inform our users. Because Xero (for some inexplicable reason) does not have the capability to send a webhook POST to our app when this disconnection event happens, we had to resort to polling the OAuth connections every hour to check that the connection was still valid.

This polling is run on a background worker server that spawns a separate thread for each Xero connected company to run the poll. Each polling process does a XeroRuby::ApiClient.new(credentials: creds) to instantiate the xero_client object, then does a xero_client.refresh_token_set(company.token_set) using the old token sets that are stored in the Company model.

To validate that we have a connection to Xero, we then do a loop through the xero_client.connections to pull the tenantID for all the connections (since we now support one of our companies linked to multiple Xero companies via OAuth2).

For each connection, we do a xero_client.accounting_api.get_organisations(tenantId).organisations.first to pull the Organisation name. If successful, we deem the connection to be still active, and we re-save the company name and tenantID in our Company model as the list of companies that are synced with our system.

As part of the polling routine, we re-save the current refresh_token into the Company model as well, ready for the next call <-- THIS IS THE CRITICAL BIT I believe.

Then, we have a separate routine that runs periodically which looks at the list of synced Xero companies for each of our customers, and runs an import routine (also multi-threaded).

From our audit, it looks like when the connection polling was run, the collision of access token across multiple threads meant that when the xero_client.connections and the xero_client.accounting_api.get_organisations(tenantId).organisations.first was pulling the list of companies from another, completely separate Xero user (Company A), and then saving the other users list of Xero tenantIDs and company names in the Company on a different thread (Company B).

Hence, the refresh_token from the other company is also saved into the Company B company model in a different thread.

This meant that when the second poll was done to sync the payroll data itself, when Company B goes to sync, the refresh_token and list of tenantIDs from Company A was used, resulting in Company A's employees being imported into Company B.

As you can appreciate, this is a serious data breach, and we are in discussion with our customers now about this. Because we have customers in Europe and NZ etc., we have to report this data breach as part of their privacy laws, and this can result in serious ramifications for us..

It should be noted that we used this exact method with the old OAuth 1 protocol, and plain JSON data with absolutely no issues for over 2 years. The new Xero gem, which was purported to be better, has been the cause of this issue.

armstrjare commented 3 years ago

@CyberFerret Your scenario is essentially the one I proposed above where the connections of tenants are requested dynamically on an ongoing basis. In principle, I see nothing nothing wrong with this approach - but certainly impacted by race-conditions with the flaws identified.

Just a quick comment, without seeing your code, it sounds like you may be able to mitigate the fundamental problem temporarily by instead of deferring to api_client.connections.map { |t| t['tenantId'] } for the tenantIds, rather, in your background worker, use the existing ids persisted in your database for that connection and iterate over those.

CyberFerret commented 3 years ago

@armstrjare I just saw your post soon after I posted mine above, and you hit the nail on the head. Thanks for your suggestion on mitigating - we may give that a shot. Right now, we have had to change our sync routine to not use threads, but to run each company refresh synchronously.

That fact, combined with the absurdly low 60 requests per minute API limiting that Xero stipulates, means that refreshing thousands of employees for our customers now takes nearly an hour to run, because we can no longer trust the concurrently threaded processes to be reliable as before.

We've never had issues like these with other payroll vendors we integrate with. Really gives me the impression that Xero are a kindergarten level system that is not in the same ballpark as most other serious payroll vendors!

armstrjare commented 3 years ago

@CyberFerret Another option might be to monkey patch XeroRuby::Configuration to have a per-thread default object. This doesn't fix underlying design but would solve the immediate security risk with things like Sidekiq jobs etc.

# Monkey-patch for naive thread "safety"
def XeroRuby::Configuration.default
  Thread.current[:xero_ruby__configuration_default] ||= new
end

We've never had issues like these with other payroll vendors we integrate with. Really gives me the impression that Xero are a kindergarten level system that is not in the same ballpark as most other serious payroll vendors!

I'm sure they're doing their best, but does seem like something slipped through the cracks in their effort to push first-party API clients. I think the issue is possibly related to the code-generation they're doing. I haven't looked into it. I agree it is surprising this slipped through because it should have stuck out pretty obviously during code review as something that's not right with the way the client is architected and as something likely to cause potential issues.

SerKnight commented 3 years ago

Hey all. Lots to digest here. Really helpful explanation. I’m going to chew through all your suggestions again and put a pull request together as soon as I can.

SerKnight commented 3 years ago

Hey everyone. Doing some final testing of the PR but @armstrjare 's suggestions were fantastic. Worked through all areas this could pose an issue.

Releasing this asap as its not breaking and crucial issue, but please have a look at the PR and we can iterate on any other improvements or tests that would benefit the work.

https://github.com/XeroAPI/xero-ruby/pull/152

@nikz in addition to adding thorough specs I extended your live example to show the fix.

https://github.com/XeroAPI/xero-ruby-oauth2-app/pull/40/files

Cleaned up logs: Screen Shot 2021-04-19 at 5 49 27 PM

SerKnight commented 3 years ago

Fixed in 2.9.1 release

https://github.com/XeroAPI/xero-ruby/releases/tag/2.9.1

nikz commented 3 years ago

πŸ‘ LGTM, nice work @SerKnight - only suggestion I'd make is to look at flagging the gem version so that folks get a Dependabot or similar PR to update and don't get surprised :)

CyberFerret commented 3 years ago

Thank you for the quick work to fix this issue @SerKnight. Special thanks also to @armstrjare for the detailed breakdown of the issue and for his super useful tips for us that we used to mitigate the problem while waiting for the patch. Apologies also to everyone for my rather testy observations in my comments above, but my stress levels were pretty sky high when I saw that we were facing data breach repercussions caused by this bug. Hopefully it will all be smooth sailing from here on.

SerKnight commented 3 years ago

πŸ‘ LGTM, nice work @SerKnight - only suggestion I'd make is to look at flagging the gem version so that folks get a Dependabot or similar PR to update and don't get surprised :)

Yeah you mentioned that prior. I’m not familiar with flagging a gem vsn but will do some reading and figure that out tomorrow.

@CyberFerret not a worry! Apologies you experienced the bug and glad we could collaborate to get it fixed. Issues like this don’t get solved without the sharing of hard learnings. Dm me @serknight_ Twitter I’d like to send a token of appreciation.

carlospeix commented 3 years ago

Hi all! I just participated laterally and with a minor impact on this issue (2.9.0 not working due to a missing timeout config value). Thank you all guys and thank you @SerKnight for your kind answer to a minor impact within the big issue you where solving. @CyberFerret I really hope for no serious impact in your situation. Thanks