Joystream / joystream

Joystream Monorepo
http://www.joystream.org
GNU General Public License v3.0
1.42k stars 115 forks source link

Query Node "Intervention" Meeting #3708

Closed bedeho closed 2 years ago

bedeho commented 2 years ago

Question

Should we do this meeting sooner? Depends on how soon people can get their deliverables ready, see below.

Meeting

Tue 10th

Participants

Background

The Query Node (QN) is an instrumental piece of the infrastructure that not only powers key applications in the Joystream ecosystem, namely Pioneer and Atlas, but also powers key secondary infrastructure software like storage and content delivery nodes, as well as the CLI and other t

Current Problems

Here is an incomplete and unranked list of problem that seem to occur as part of the development and usage of the QN.

Note: This list reflects my point of view, which is not very granular or complete, and so it may need adjustments before the meeting.

1. Unreliable

It very frequently goes down at various stages of both development and in production. In particular bugs during development are very costly, as teams relying on it have to wait around - perhaps mock APIs, while a QN developer can take a look, fix the problem and redeploy.

2. Impotent API

It cannot, as a matter of Hydra constraints, offer the APIs that either client teams prefer, or absolutely need as a matter of basic product features. This often requires complex coordination about what product features are required vs. nice to have, and exactly how to work around limitations which are unavoidable. Frequently we now have to build redundant state into the API, which means mappings become harder to write correctly, further contributing to the prior problem.

3. Synch issues

This one may just be me misunderstanding

On occasion it seems that we have had problems caused by making mistakes in deployments, in particular from where to process for a given purpose.

4. Incident Management

It seems we rarely have errors detected by our own Jsgenesis personell, it's always the community which finds the problem and then has to alert someone on Discord, and then whomever catches this problem ends up having to try to get to whomever is able to look into the problem.

5. Debugging

It seems it does take quite a bit of time determinig why exactly the QN went down ex-post.

Meeting Purpose

Map out and prioritise what changes we should do to

a. Hydra b. QN development process c. QN deployment process d. QN debugging process e. QN infrastructure routines

What we are looking for is to identify the really high value and cheap interventions we can do right now, because there are other urgent work we need to commence work on imminently, specifically

Deliverables Deadline

Submit as reply to this issue no later than:

Friday 6th 18pm CET

Deliverables

For each person in the meeting , please review your past tasks, bugs, bug fixes, deployment problems, pain dealing with QN, and which of the buckets 1-5 it falls into. In particular perhaps

Agenda

kdembler commented 2 years ago

I think you did a good job capturing the most pertaining issues. Biggest pains from my perspective:

1. Unreliability

One issue is the QN going down entirely, another issue are bugs, more or less visible. Sometimes things will be working fine but you add one specific field to the query and the request fails. Those are often found mid-development, blocking the developer.

2. Restricted filtering

The fact that we cannot do filter like where: { video: { channel: { id_eq: "xxx" } } } because it's one level too deep. This restricts us quite often and for Rhodes resulted in probably 10+ requests for QN developers to "flatten" fields on different events and entities so that they can be consumed. Also, lack of basic NOT filtering is problematic sometimes

3. Hydra weirdnesses

We had to work around a couple of Hydra issues when working on Rhodes, mostly related to variants and relations. Current schema for NFT should be a good example:

type OwnedNft {
  # ...
  transactionalStatus: TransactionalStatusIdle | TransactionalStatusInitiatedOfferToMember TransactionalStatusBuyNow
  transactionalStatusAuction: Auction 
}

We keep accumulating those workarounds.

4. Incident Management

I think what you brought up is a very valid point, that we don't have much visibility into when things break. For Atlas, this wasn't that big of an issue since incidents for us happened mostly in development, so usually a developer noticed QN not syncing right away. However, it's been an issue to get access to the QN to get logs (this has been mostly alleviated at this point). There is also the fact of sync taking quite a bit of time to finish. It's not an issue assuming we don't do it often, but during Rhodes development there were days when I did 3 re-syncs. If each takes 2 hours to complete, it adds up.

zeeshanakram3 commented 2 years ago

Here is the list of major QN related bugs+limitations that I faced and the PRs that addressed them


Logical bugs in mappings

These bugs include arithmetic bugs, invalid assumptions about records returned by DB query, primary key duplication bugs, etc.


Unresolved promises


Unloaded entity relationships

For loading a record from DB, we explicitly need to mention the relations that are required to be loaded in mappings. The issue is that sometimes devs miss passing relations arrays, and since there is no compile-time error/warning for this, the bug is only caught during code execution in the form of a QN crash, e.g., for the Video entity.

type Video @entity {
  id: ID!
  channel: Channel!
}

Now if we need to use the channel relation in mapping

// Invalid case
const video = await store.get(Video, { where: { id: SomeId } })
console.log(video.channel) // undefined; 
// Valid case
const video = await store.get(Video, { where: { id: SomeId }, relations: ['channel'] })
console.log(video.channel) 

No support for nullable variant field

This issue was fixed by Hydra PR -> https://github.com/Joystream/hydra/pull/480


Flattening fields

There were a bunch of PRs in this category because filtering query/s do not work for deeply nested fields


No support for querying deeply nested variant fields

Sometimes it is feasible to define a field as a variant (instead of a separate entity); however, we can't choose that because of Hydra querying limitations, i.e., Querying doesn't work for deeply nested variant fields. Point 3 of Klaudiusz comment points out the same issue

Suggested Improvements

Increasing test coverage of Integration tests?

Already discussed this with Klaudiusz in chat that maybe we need to considerably improve the test coverage of mappings. There are a lot of use cases, and we are not testing all. If we have considerably well test coverage then most of the time logical bugs would be caught in tests in contrast to, someone trying some edge case scenario in production and breaking the QN.

Unit test suites (if possible)?

If possible we should target writing unit tests for individual mappings (mocking database manager & event objects etc.)

ondratra commented 2 years ago

1. Unreliability

It is very unfortunate that when a developer or end-user that consumes QN content encounters a bug in the mappings needs to wait for a fix from QN developers and is usually blocked until the fix is implemented and deployed.

1.1 Test coverage

As @zeeshanakram3 has pointed out, we could increase the test coverage for the mappings. While that would definitely be beneficial, it would be really costly if we wanted to cover all mappings, I'm afraid. That being said, an introduction of the unit test would decrease development time as well as CI test running time in comparison to the integration test that we solely use to test the QN now.

1.2 Dev-net rollback

Many times we had a situation when Atlas or Pioneer developer made a blockchain transaction that caused a crash in the processor (due to a bug). Then they wished the transaction never happened because then it blocked all of the other work they could be doing if the processor was still running.

When I was debugging one of the recent council mappings issues that happened on the current Olympia-net in a quite high block numbered block, I periodically backed up my local processor database using pg_dump. E.g. I ended up having checkpoint dbs with processed data up to block 10,000, 50,000, etc. I could then swap db content anytime to inspect the state or start from a checkpoint with fixed mappings and save a lot of time that would it take to reprocess everything.

That brought me to the idea that we could maybe automatize this db dump and recovery for our development servers. That would enable us to create checkpoints that we can revert the QN state to. We would also need to find a way to also revert blocks in the dev Joystream node to make this work. From what I found, it could be possible by using Substrate --revert command; or we could save chain state to files in a similar way as QN db.

2. Impotent API

I guess the biggest issue is the limited filtering possibilities @kdembler mentioned. This issue creates a lot of overheads in the form of what we lately tend to call flattening data - basically introducing data duplicates in various places, thus breaking Normal forms. If we expand the options for filtering, I think the API is not so bad.

4. Incident Management

There are two main types of issues we are constantly encountering.

4.1

The first one is that processor crashes on some bad implementation on an unexpected situation, and we are oblivious to that until some frontend (Atlas, Pioneer, etc.) user reports that.

To remedy this issue, we could expose data saved in the processor's db table processed_events_log by introducing a new ProcessedEventsLog entity to a schema that would unmask this now private table. We could then set up a monitoring service that would query Joystream node, indexer, and processor; compare their highest block number, and alert us if the indexer or processor lags behind the Node.

4.2

The second issue is that due to its linear nature, the processing of one bugged mapping crashes the whole processor and blocks any further processing, even on completely unrelated mappings. E.g. broken council mapping will prevent the processing of storage events.

In some cases, the problem could be solved by introducing extra metadata to event mappings definitions describing dependency graph. Each event could have a category associated and a list of categories it depends on. Once a mapping would throw an error, the processor could gracefully overcome this error, block processing of further events in depending categories, and report an issue.

Not only could production setups benefit from this feature, but development instances could use this feature to further distinguish currently unstable yet unrelated features.

My recent fixes

Looking at the list of my recent PRs, they contained a combination of:

mnaamani commented 2 years ago

Deployment

To provide high availability and performant service we are running two clusters.

  1. Linode cluster (2 nodes) Problem
    1. Linode load balancer requires manual TLS cert configuration. Only mokhtar and martin have access, only mokhtar has knowledge of how to update the cert.
    2. Certbot managed cert on devops.joystream.org machine.
    3. Two QNs each run their own processor this can lead to each node possibly having non-deterministic ids generated? Or has this been resolved for all cases? It can still have issues if one nodes is behind in syncing (from client query point of view) This does happen occasionally when joystream-node syncing stalls. I suggest we only have single node active and hot standby. It is only used as backup so should be okay during transition. Unless there are queries that are "slow" and block other queries?
  2. Kubernetes (k8s) cluster (3 cluster nodes, 1 processor, 1 indexer, 6 load balanced graphql instances) Solves most issues from linode cluster and is used as primary instance. Linode cluster is used as backup when updating k8s cluser. Managing and monitoring this cluster is and order of magnitude more complicated than linode cluster.

Switching between clusters requires manual DNS record update of query.joystream.org CNAME record. This can be mitigated by having two QNs running in separate namespaces in the same cluster. -> More complex setup.

If some records do not have deterministic Ids does switching between clusters cause any odd behaviour on client if client is passing id in query?

Re-indexing vs Re-Processing on updates

Not clear when it is safe to only reprocess vs, re-indexing and re-processing. On more than one occasion there were no types changes or schema changes (only mappings code changes) and it seemed reasonable to only reset the processor db and re-process but it caused processor to fail at some point in the mappings (always related to council/election)

Updating without downtime

Updating production cluster without downtime is always lengthy manual process. Switch to backup cluster Build new joystream/apps docker image Deploy to k8s, update AWS loadbalancer DNS records Wait for sync Switch back to production Update backup cluster

General

Given that query-node is in the monorepo sometimes it can be problematic to update if other services are running from the same repo (argus, colossus) but easy workaround to have separate directories on production machine.

Processor was much much faster pre-olympia.. what happened? Is is due to processing every single block or just the fact there are more events to process and mappings are slower (performing slow queries before updates)?

Indexing has always been quite slow. How can we speed it up. Is too much logging to console slowing things down?

Last deployment complexities and debugging issues faced

In general debugging of actual queries on my own is very limited as I have very little knowledge of details of the schema or business logic in some cases. So I am limited to ensuring queries do not fail but I cannot asses whether the actual results are "correct". (Unless of course I go back and review mappings implementation)

Debugging operation of the clusters ie checking logs, and checking on status of clusters is reasonably easy but we have no downtime monitoring or alerting tools in place. Currently experimenting with New Relic for k8s cluster and Prometheus (for joystream-node) with some mixed results.

I always seem to forgot to update Linode load balancer certificate in time, so I want to move away from that and possibly use an nginx/caddy server on a standalone host instead. This isn't an issue for the AWS k8s cluster.

We should send/aggerate the logs from the active query-nodes to a central location that can be accessed by team.

Dev suggestions

There needs to be a way to write unit tests for mappings. Only way we can currently test mappings is with integrations tests, and we do not have full coverage there.