When dealing with huge amounts of data, users may want to sample the data instead of keeping accurate data. This will enable users to save the amount of space required and the cost of hosting of course.
Server-Side Sampling - Users will send events to the server and the server will decide if events need to be stored or not.
Client-Side Sampling - Clients will send sampled data, the server does not need to sample it.
Query-Side Sampling - Store 100% of data but execute the query on only a percentage of data.
Plotting Sampled Data to 100%
Suppose, we are dealing with a page view event that has to be sampled to 5%
So, the user will send only 5 events for 100 events received. But while querying user always needs to see 100% of their data. For e.g, users want to count how many page view events occurred. Although we have only 5 events, due to 5% sampling we'll show the users that 100 events occurred.
Change of sampling percentage
Users may change the sampling percentage in between. For example, the user has to have a sampling % of 5% for 1st month, 10% for 2nd month, 8% for 3rd month etc.
How, do we plot data to the current sampling % when data is queried over 3 months?
Solution of the above problem
We can’t allow users to change the sampling % once defined, if the user wishes to do so he'll have to create a new event and move all data of the existing event to the new event with the specified sampling %.
Whenever users wish to change the sampling % then update all existing data to match the new sampling %. Do not allow any change of sampling % in the middle of the day, change will be applied after the current date. This will allow us to update previous data till the current date as we won’t store unique event IDs for every event(Id brust issue).
Use event-driven architecture to update existing data.
When dealing with huge amounts of data, users may want to sample the data instead of keeping accurate data. This will enable users to save the amount of space required and the cost of hosting of course.
Plotting Sampled Data to 100%
So, the user will send only 5 events for 100 events received. But while querying user always needs to see 100% of their data. For e.g, users want to count how many page view events occurred. Although we have only 5 events, due to 5% sampling we'll show the users that 100 events occurred.
Change of sampling percentage
Users may change the sampling percentage in between. For example, the user has to have a sampling % of 5% for 1st month, 10% for 2nd month, 8% for 3rd month etc.
Solution of the above problem