Closed bochecha closed 3 years ago
I'm not sure we're ready to merge this at the moment, but at least the change is here, we can bite the bullet when the time comes.
It turns out just removing the column doesn't actually reclaim any space. This requires a VACUUM FULL
, which locks the table for a while. (it took 40 minutes on my laptop)
Before the vacuum, my local database took ~50GB. During the vacuum it grew to more than 80GB. After the vacuum it had shrunk to ~40GB.
Do you think that it’s interesting to merge this PR? I can fix conflicts if needed.
I think this is worth doing. It may be worth checking whether all fields of the request are used. (If not we would be discarding data.) I believe the serialized payload for unrecognised events is separately stored in 3 other tables, right?
It may be worth checking whether all fields of the request are used.
It’s hard for me to tell. Current fields are:
sha512, received_at and send_number are probably useful for technical and debug reasons. absolute_timestamp and relative_timestamp are useful according to this explanation (fascinating), and machine_id is probably needed, at least until we find a way to store more anonymous data.
I believe the serialized payload for unrecognised events is separately stored in 3 other tables, right?
Payload is stored in unknown_aggregate_event, unknown_sequence, unknown_singular_event. It’s also stored in tables for invalid events.
It may be worth checking whether all fields of the request are used.
It’s hard for me to tell. Current fields are:
- sha512
- received_at
- absolute_timestamp
- relative_timestamp
- machine_id
- send_number
sha512, received_at and send_number are probably useful for technical and debug reasons. absolute_timestamp and relative_timestamp are useful according to this explanation (fascinating), and machine_id is probably needed, at least until we find a way to store more anonymous data.
My question is: are these fields all extracted and used for some purpose? e.g. machine_id
clearly is because it ends up in metrics_request_v2
. So there is no particular need to keep a separate copy of it. I guess the two timestamps are combined to form another timestamp which is stored in a different field?
My question is: are these fields all extracted and used for some purpose? e.g.
machine_id
clearly is because it ends up inmetrics_request_v2
I’m really sorry, but there’s something I don’t get. 😒
As far as I can understand, this PR removes the serialized
column, as it assumes that all the fields included in this serialized data are stored in a table. They are: all the fields listed in my comment are stored in metrics_request_v2
.
So, I’m sorry… I’m not sure that I understand your question.
I guess the two timestamps are combined to form another timestamp which is stored in a different field?
Yes, they are used to create occured_at
for singular and aggregate events, and started_at
/ stopped_at
for sequence events.
My question is: are these fields all extracted and used for some purpose? e.g.
machine_id
clearly is because it ends up inmetrics_request_v2
[…] They are: all the fields listed in my comment are stored in
metrics_request_v2
.
Great – so everything in the serialized request is stored separately in the database, and (assuming we trust that it's being unpacked correctly, which we do) then there is no reason to keep the serialized request around.
I just found an old ticket with an interesting question that we actually can only answer with the raw request data, namely estimating the cost to users of metrics instrumentation. To do that we would need to know the size of the HTTPS requests they submit. At present we could get this from the size of the serialized request plus an approximately fixed overhead. If we drop the raw request it will become much harder.
But… in 5 years we have never actually bothered to answer this question so I am tempted to say YAGNI. We could always readd something if needed.
If we drop the raw request it will become much harder.
As we have the payload for each table, it would be a bit long but not that hard to get this information, at least for fields with a fixed size.
Rebased onto master after #142.
There was a rather confusing merge 238d142 on this branch with the same patch on both sides of the merge, f0aa0de and 4981bf9.
The only difference between the two patches was in the alembic migration, to base the migration on 63365dc8695a
(added in 92914f69146c4b56da922a0719a82438b822e45c) rather than 98078d059259
(an older revision). So I amended the patch by hand (having got confused by git) and force-pushed without the confusing merge.
Oh, actually I was confused because it was the merge commit which set the correct down_revision
.
I had decided to keep that originally so that in case of a dramatic failure we would be able to reparse everything from the raw bytes to fix whatever issue had come up.
It seems like things are working just fine though, so the time has come to remove it and reclaim some disk space.
https://phabricator.endlessm.com/T31411