endlessm / azafea

Service to track device activations and usage metrics
Mozilla Public License 2.0
10 stars 2 forks source link

Add new event tables #121

Closed liZe closed 3 years ago

liZe commented 4 years ago

(WIP: tests missing.)

Fix #64.

liZe commented 4 years ago

:heavy_check_mark: replaying unknown data works!

I have to know why the size of read data is a huge value. Looks more like a token (it’s advertised as a token in eos-event-recorder-daemon) than a data size…

liZe commented 4 years ago

I have to know why the size of read data is a huge value. Looks more like a token (it’s advertised as a token in eos-event-recorder-daemon) than a data size…

@adarnimrod @dbnicholson I think that I’ve finally found where the problem comes from.

https://github.com/endlessm/eos-event-recorder-daemon/blob/bf6b292b2088d41bd3445c95d4f34f9a38d444b2/daemon/emer-daemon.c#L785

Shouldn’t we use *token instead of token?

liZe commented 4 years ago

This PR is ready for review.

Tests have been added, they check the clamping function too.

Replaying old data works and shouldn’t take too long on production with enough workers. The code automatically replays only known events, there’s no need to manually filter them.

From the sample of old data I’ve tested, the content of the columns is OK. The only problem we have is the size of data read from cache, but that’s a bug in eos-event-recorder-daemon as far as I can tell. It means that all the data we have so far is meaningless for this value, and there’s no way to get it back. I can fix the bug in eos-event-recorder-daemon, if you think that it’s a bug too.

You can join me today, I’ll read my messages on Slack next week too.

wjt commented 4 years ago

@liZe great catch. I opened a PR to fix the *token typo and also another type error by inspection (but I haven't even compiled it!). https://github.com/endlessm/eos-event-recorder-daemon/pull/239

I think we just accept that many old events have garbage in that field. It would be interesting to know whether any events have garbage in the first field, given the other bug I think I found :)

wjt commented 4 years ago

@adarnimrod could you take a look at the details here?

liZe commented 3 years ago

It would be interesting to know whether any events have garbage in the first field, given the other bug I think I found :)

After 35,000 events replayed (100 lines added to cache_has_invalid_elements), all the values are good. I let the script run, I’ll give more reliable information when I have more lines.

liZe commented 3 years ago

I let the script run, I’ll give more reliable information when I have more lines.

Replaying old events is really, really slow.

After 100,000 events (200 lines in cache_has_invalid_elements), there’s no obviously wrong value. I’ll modify the script to only replay cache events and run it one more night.

liZe commented 3 years ago

I’ll modify the script to only replay cache events and run it one more night.

After 1 night I’ve inserted almost 30,000 lines into cache_has_invalid_elements. 170 lines have a bad value for number_of_valid_elements, you were right @wjt!

I don’t know what to do with these wrong values. We can probably fix them later.

The replay also crashed:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/lize/Informatique/azafea/azafea/__main__.py", line 16, in <module>
    cli.run_command(*sys.argv[1:])
  File "/home/lize/Informatique/azafea/azafea/cli/__init__.py", line 63, in run_command
    args.subcommand(config, args)
  File "/home/lize/Informatique/azafea/azafea/event_processors/endless/metrics/v2/cli.py", line 450, in do_replay_unknown
    dbsession.commit()
  File "/home/lize/.local/share/virtualenvs/azafea-_3hjtXBi/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1036, in commit
    self.transaction.commit()
  File "/home/lize/.local/share/virtualenvs/azafea-_3hjtXBi/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 503, in commit
    self._prepare_impl()
  File "/home/lize/.local/share/virtualenvs/azafea-_3hjtXBi/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 471, in _prepare_impl
    self.session.dispatch.before_commit(self.session)
  File "/home/lize/.local/share/virtualenvs/azafea-_3hjtXBi/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/home/lize/Informatique/azafea/azafea/event_processors/endless/metrics/events/__init__.py", line 893, in receive_before_commit
    upsert_machine_demo(dbsession, instance.request.machine_id)
AttributeError: 'NoneType' object has no attribute 'machine_id'

I had to add a dirty workaround to avoid the crash, but there’s probably a much better fix to find. @adarnimrod Did you see this traceback before?

liZe commented 3 years ago
    upsert_machine_demo(dbsession, instance.request.machine_id)
AttributeError: 'NoneType' object has no attribute 'machine_id'

I don’t understand how it’s possible to have None in instance.request. Unknown lines all have a request_id that is for sure in the metrics_request_v2 table, as there’s a foreign key.

adarnimrod commented 3 years ago

@adarnimrod Did you see this traceback before?

No, didn't see an error like that before.

liZe commented 3 years ago

I’ve tried to reproduce the error, but I’ve the script for multiple nights without a crash. There’s nothing strange in the database. It was probably a network or a database problem.

I think that we can launch the script and keep an eye on it. Even if it crashes, the table won’t be in a bad shape. It would be annoying for sure, but not that bad.

liZe commented 3 years ago

Thanks to the new backup, I’ve been able to reproduce the bug and to debug. It will for sure happen on production if it’s not fixed.

We have a crash because the relationship attribute request is not set on this new object. When ORM objects are created, their relationships are generally lazy-loaded when needed. But in this case, request is not set and gives None when accessed.

Flushing the session sets the request attribute and fixes the bug. But that’s not the best solution.

SQLAlchemy enables lazy loading for relationships when possible, but it’s obviously not possible on this case. I have to find why, but it’s complicated: the request relationship is not a classical SQL relationship, because it’s set on all the tables with a foreign key on the Request table. To avoid the declaration of the attribute on each table, it’s set on the MetricEvent abstract class with a declared_attr decorator. The request_id foreign key is a declared_attr too, but it works in our case because it’s set during the ORM object creation…

I’ll try different changes to understand why the relationship is not available, and to find out why it only happens with this peculiar row.

liZe commented 3 years ago

I think that the problem is now solved.

Relationship attributes are not available until the session is flushed or committed. We have to use enable_relationship_loading to lazily load the relationships even if the row has not been inserted yet. The commit message includes more information about this.

I’m not sure that repeating the enable_relationship_loading call is the best solution, but it’s at least readable and doesn’t enable the feature when we don’t require it.

adarnimrod commented 3 years ago

Do we have a Phabricator task for this?

wjt commented 3 years ago

Created https://phabricator.endlessm.com/T30987