getsentry / team-ospo

Open Source Program Office (OSPO)
https://open.sentry.io/
10 stars 1 forks source link

Improve self-hosted ⇒ SaaS conversion alongside EU rollout #153

Open azaslavsky opened 1 year ago

azaslavsky commented 1 year ago

Problem

The current conversion rate for users migrating (henceforth referred to as “relocating”, to differentiate it from normal database migrations) from self-hosted is unideal: less than 1% of users who enter the funnel successfully become Sentry SaaS customers.

The relocation also places a heavy toll on ops support, as each relocation must be carried out manually. The volume of these support efforts is expected to increase greatly with the debut of EU-region support in the second half of this year. For users that are already on SaaS Sentry, a similar relocation may need to occur as they leverage the hybrid cloud effort to move regions.

Finally, the current method of relocation, using an ad-hoc, manually executed script as a second “backdoor” method of importing, is untested and difficult to maintain. This has resulted in subtle schema skew bugs🔒 that have taken significant effort to fix in the past, and could have been much more damaging had they not been caught quickly.

Goals

There are several goal targets:

Non-Goals

There are a number of potential future improvements we are explicitly not optimizing for in this first pass. This is not to say that we won’t be interested in circling back and implementing them after the relocation pipeline is healthy and running (see Potential Future Work below), just that they are not strictly in scope for the first milestone.

Assumptions

The main assumption is that the current conversion rate in the funnel is primarily blocked on the slowness and difficulty of the relocation. It is possible, though intuitively unlikely, that we make the relocation much easier and conversion rates do not meaningfully increase.

Another assumption is that the organization-merging functionality of the load-it-up.py script is vestigial and not needed by ops, and that we should therefore prefer to just keep the original organizations (modulo changing slugs) when they appear in a backup.

Proposal

We propose to do the following:

  1. Write a thorough set of test cases for both backup.py and load-it-up.py. In theory, the import_ method on backup.py should have sufficient flexibility to replicate everything that load-it-up.py does (modulo the merging of orgs, see above), so a good end state is to have both scripts pass the same set of tests.

  2. Modify import_ to use .create() instead of .save(), and to call the serializer’s .validate() method before .create() (we may opt to keep the old functionality behind a self-hosted-only flag). This will make import_ an INSERT only script, will ensure that data is validated before being ingested, and will prevent any existing data from being modified on the relocation target.

  3. Once we are confident that load-it-up.py can be retired in favor of the import_ flow on backup.py, and that the backup.py import/export functionality can be used on both SaaS and self-hosted, we will add an API endpoint to perform imports for new accounts. This would probably involve importing to some siloed or otherwise protected “import database” and validating the data, before relocating all of that database’s data to the main SaaS database.

  4. Add a screen during on-boarding (post email-verification) that allows users to upload their exported self-hosted JSON backup (note: these could be quite large, so even with user verification in place, we’ll still need to think a bit about resource limits here). This would hit the endpoint described above, and send the user an email when their relocation succeeds, or otherwise notify them that it failed and open a ticket.

  5. In the case of failure (that is, the user uploaded a JSON backup that could not be validated), we will inform the user and automatically open a support ticket on their behalf.

Risks

The major risk is that by changing the process, which works in its own brittle way at the moment, we introduce production breakages or data corruptions. To mitigate this, great care will need to be taken to ensure that an expansive test suite is provided to guarantee that this process won’t damage data on either the exporting or importing side.

In terms of resources and API design, we are going to be importing and then processing very large JSON blobs, then merging them into production databases. Care will need to be taken to ensure that these operations are all properly secured and throttled, so as not to introduce user-input vulnerabilities, via either malicious intent or simply very large inputs.

Because we are uniting two implementations into one, there is always some risk that some property of one of the implementations will be lost. It is a bit difficult to ascertain how likely this is because of the almost complete lack of tests for both implementations, so we will need to rely on some combination of a new but thorough test suite and user reports to guard against this.

Open Questions

There are some important open questions that will need to be resolved during implementation:

Potential Future Work

All of the non-goals mentioned above (increasing the scope of relocatable artifacts, one-click server-to-server integration, and more customizable and precise relocation operations) are on the table as we move forward. In particular, it would be very nice to get to an end state where users start a relocation (either self-hosted -> SaaS, or SaaS region-to-region via hybrid cloud), and we seamlessly move 100% of their region-siloed data over in a way that is almost entirely opaque to them. This could include temporarily forwarding events that occur while the relocation is taking place, and carefully handing over control between the source and target of the relocation, so that from the user perspective, the whole operation is “one click and wait for a confirmation email” easy.


# Q3 Milestones
- [ ] https://github.com/getsentry/team-ospo/issues/154
- [ ] https://github.com/getsentry/team-ospo/issues/155
- [ ] https://github.com/getsentry/team-ospo/issues/156
- [ ] https://github.com/getsentry/team-ospo/issues/158
- [ ] https://github.com/getsentry/team-ospo/issues/170
- [ ] https://github.com/getsentry/team-ospo/issues/171
- [ ] https://github.com/getsentry/team-ospo/issues/166
- [ ] https://github.com/getsentry/team-ospo/issues/182
- [ ] https://github.com/getsentry/team-ospo/issues/167
- [ ] https://github.com/getsentry/team-ospo/issues/172
- [x] ***MILESTONE 1 DONE:** All import/export functionality works locally*
- [ ] https://github.com/getsentry/team-ospo/issues/181
- [ ] https://github.com/getsentry/team-ospo/issues/183
- [ ] https://github.com/getsentry/team-ospo/issues/184
- [ ] https://github.com/getsentry/team-ospo/issues/193
- [ ] https://github.com/getsentry/team-ospo/issues/192
- [ ] https://github.com/getsentry/team-ospo/issues/168
- [ ] https://github.com/getsentry/team-ospo/issues/199
- [x] ***MILESTONE 2 DONE:** Imports are properly validated using production services*
- [ ] https://github.com/getsentry/team-ospo/issues/186
- [ ] https://github.com/getsentry/team-ospo/issues/201
- [ ] https://github.com/getsentry/team-ospo/issues/185
- [ ] https://github.com/getsentry/team-ospo/issues/196
- [ ] https://github.com/getsentry/team-ospo/issues/202
- [ ] https://github.com/getsentry/team-ospo/issues/204
- [ ] https://github.com/getsentry/team-ospo/issues/203
- [ ] https://github.com/getsentry/team-ospo/issues/207
- [ ] https://github.com/getsentry/team-ospo/issues/197
- [ ] https://github.com/getsentry/team-ospo/issues/169
- [x] ***MILESTONE 3 DONE:** feature deployed behind API endpoint in limited-availability*
# Q4 Workstreams
- [ ] https://github.com/getsentry/team-ospo/issues/215
- [ ] https://github.com/getsentry/team-ospo/issues/210
- [ ] https://github.com/getsentry/team-ospo/issues/217
- [ ] https://github.com/getsentry/team-ospo/issues/178
- [x] Enable SE-assisted portion of feature and monitor uptake
- [x] ***MILESTONE 4 DONE:** Feature polished for early-availability release*
- [ ] https://github.com/getsentry/team-ospo/issues/195
- [ ] https://github.com/getsentry/team-ospo/issues/222
- [ ] https://github.com/getsentry/team-ospo/issues/191
- [ ] https://github.com/getsentry/team-ospo/issues/213
- [x] ***MILESTONE 5 DONE**: Planned Q3 work completed, in general-availability*
- [ ] https://github.com/getsentry/team-ospo/issues/214
- [x] Create dashboard with visibility into conversion rate
- [ ] ***MILESTONE 6:** Fully rolled out and stable*
- [ ] ***WORKSTREAM 1**: Support backup from older versions*
- [ ] https://github.com/getsentry/team-ospo/issues/179
- [ ] ***WORKSTREAM 2**: Enable all viable models to be relocatable, including high-volume clickhouse stored ones*
- [ ] https://github.com/getsentry/team-ospo/issues/187
- [ ] ***WORKSTREAM 3**: Other desirable feature improvements*
- [ ] Support abruptly stopped relocations, with some auditing included
- [ ] Get self-hosted broadcast running so we can notify people about this
- [ ] Support import chunking
### Not Yet
- [ ] https://github.com/getsentry/team-ospo/issues/190
- [ ] https://github.com/getsentry/team-ospo/issues/188
- [ ] https://github.com/getsentry/team-ospo/issues/216
chadwhitacre commented 1 year ago
chadwhitacre commented 12 months ago

After talking with PMM, I've added a new stretch goal for this project: to get the self-hosted broadcast system running. It would be great in general to be able to send "What's New" messages to self-hosted users (we could announce new versions, for example, especially out-of-band releases, as well as the Self-hosted Sesh). This could also entail understanding why the beacon data is so off, since the beacon and broadcasts are related.

chadwhitacre commented 9 months ago

Design: Relocation-Specific Models

chadwhitacre commented 8 months ago

Chatted with @azaslavsky, I'm going to help with recruiting self-hosted users to help us develop and test out this process, I'll start my prospecting on https://github.com/getsentry/sentry/discussions/49564. Likely end up working with SE on this as well.

chadwhitacre commented 6 months ago

Talked on OSPO team meeting ... EU is a big rollout, if it makes sense let's ship to US first as a soft-launch and take it to EU when that's fully ready.

chadwhitacre commented 3 months ago

Relocation is live in US and EU. 👍