metrics: Drop the serialized request column

bochecha commented 4 years ago

I had decided to keep that originally so that in case of a dramatic failure we would be able to reparse everything from the raw bytes to fix whatever issue had come up.

It seems like things are working just fine though, so the time has come to remove it and reclaim some disk space.

https://phabricator.endlessm.com/T31411

bochecha commented 4 years ago

I'm not sure we're ready to merge this at the moment, but at least the change is here, we can bite the bullet when the time comes.

bochecha commented 4 years ago

It turns out just removing the column doesn't actually reclaim any space. This requires a VACUUM FULL, which locks the table for a while. (it took 40 minutes on my laptop)

Before the vacuum, my local database took ~50GB. During the vacuum it grew to more than 80GB. After the vacuum it had shrunk to ~40GB.

liZe commented 3 years ago

Do you think that it’s interesting to merge this PR? I can fix conflicts if needed.

wjt commented 3 years ago

I think this is worth doing. It may be worth checking whether all fields of the request are used. (If not we would be discarding data.) I believe the serialized payload for unrecognised events is separately stored in 3 other tables, right?

liZe commented 3 years ago

It may be worth checking whether all fields of the request are used.

It’s hard for me to tell. Current fields are:

sha512
received_at
absolute_timestamp
relative_timestamp
machine_id
send_number

sha512, received_at and send_number are probably useful for technical and debug reasons. absolute_timestamp and relative_timestamp are useful according to this explanation (fascinating), and machine_id is probably needed, at least until we find a way to store more anonymous data.

I believe the serialized payload for unrecognised events is separately stored in 3 other tables, right?

Payload is stored in unknown_aggregate_event, unknown_sequence, unknown_singular_event. It’s also stored in tables for invalid events.

wjt commented 3 years ago

It may be worth checking whether all fields of the request are used.

It’s hard for me to tell. Current fields are:

sha512

received_at

absolute_timestamp

relative_timestamp

machine_id

send_number

sha512, received_at and send_number are probably useful for technical and debug reasons. absolute_timestamp and relative_timestamp are useful according to this explanation (fascinating), and machine_id is probably needed, at least until we find a way to store more anonymous data.

My question is: are these fields all extracted and used for some purpose? e.g. machine_id clearly is because it ends up in metrics_request_v2. So there is no particular need to keep a separate copy of it. I guess the two timestamps are combined to form another timestamp which is stored in a different field?

liZe commented 3 years ago

My question is: are these fields all extracted and used for some purpose? e.g. machine_id clearly is because it ends up in metrics_request_v2

I’m really sorry, but there’s something I don’t get. 😒

As far as I can understand, this PR removes the serialized column, as it assumes that all the fields included in this serialized data are stored in a table. They are: all the fields listed in my comment are stored in metrics_request_v2.

So, I’m sorry… I’m not sure that I understand your question.

I guess the two timestamps are combined to form another timestamp which is stored in a different field?

Yes, they are used to create occured_at for singular and aggregate events, and started_at / stopped_at for sequence events.

wjt commented 3 years ago

My question is: are these fields all extracted and used for some purpose? e.g. machine_id clearly is because it ends up in metrics_request_v2

[…] They are: all the fields listed in my comment are stored in metrics_request_v2.

Great – so everything in the serialized request is stored separately in the database, and (assuming we trust that it's being unpacked correctly, which we do) then there is no reason to keep the serialized request around.

wjt commented 3 years ago

I just found an old ticket with an interesting question that we actually can only answer with the raw request data, namely estimating the cost to users of metrics instrumentation. To do that we would need to know the size of the HTTPS requests they submit. At present we could get this from the size of the serialized request plus an approximately fixed overhead. If we drop the raw request it will become much harder.

But… in 5 years we have never actually bothered to answer this question so I am tempted to say YAGNI. We could always readd something if needed.

liZe commented 3 years ago

If we drop the raw request it will become much harder.

As we have the payload for each table, it would be a bit long but not that hard to get this information, at least for fields with a fixed size.

liZe commented 3 years ago

Rebased onto master after #142.

wjt commented 3 years ago

There was a rather confusing merge 238d142 on this branch with the same patch on both sides of the merge, f0aa0de and 4981bf9.

Capture d’écran de 2021-02-17 10-42-21

The only difference between the two patches was in the alembic migration, to base the migration on 63365dc8695a (added in 92914f69146c4b56da922a0719a82438b822e45c) rather than 98078d059259 (an older revision). So I amended the patch by hand (having got confused by git) and force-pushed without the confusing merge.

wjt commented 3 years ago

Oh, actually I was confused because it was the merge commit which set the correct down_revision.

endlessm / azafea

metrics: Drop the serialized request column #79