Closed bedeho closed 2 years ago
I think you did a good job capturing the most pertaining issues. Biggest pains from my perspective:
One issue is the QN going down entirely, another issue are bugs, more or less visible. Sometimes things will be working fine but you add one specific field to the query and the request fails. Those are often found mid-development, blocking the developer.
The fact that we cannot do filter like where: { video: { channel: { id_eq: "xxx" } } }
because it's one level too deep. This restricts us quite often and for Rhodes resulted in probably 10+ requests for QN developers to "flatten" fields on different events and entities so that they can be consumed. Also, lack of basic NOT
filtering is problematic sometimes
We had to work around a couple of Hydra issues when working on Rhodes, mostly related to variants and relations. Current schema for NFT should be a good example:
type OwnedNft {
# ...
transactionalStatus: TransactionalStatusIdle | TransactionalStatusInitiatedOfferToMember TransactionalStatusBuyNow
transactionalStatusAuction: Auction
}
We keep accumulating those workarounds.
I think what you brought up is a very valid point, that we don't have much visibility into when things break. For Atlas, this wasn't that big of an issue since incidents for us happened mostly in development, so usually a developer noticed QN not syncing right away. However, it's been an issue to get access to the QN to get logs (this has been mostly alleviated at this point). There is also the fact of sync taking quite a bit of time to finish. It's not an issue assuming we don't do it often, but during Rhodes development there were days when I did 3 re-syncs. If each takes 2 hours to complete, it adds up.
Here is the list of major QN related bugs+limitations that I faced and the PRs that addressed them
These bugs include arithmetic bugs, invalid assumptions about records returned by DB query, primary key duplication bugs, etc.
For loading a record from DB, we explicitly need to mention the relations that are required to be loaded in mappings. The issue is that sometimes devs miss passing relations arrays, and since there is no compile-time error/warning for this, the bug is only caught during code execution in the form of a QN crash, e.g., for the Video
entity.
type Video @entity {
id: ID!
channel: Channel!
}
Now if we need to use the channel relation in mapping
// Invalid case
const video = await store.get(Video, { where: { id: SomeId } })
console.log(video.channel) // undefined;
// Valid case
const video = await store.get(Video, { where: { id: SomeId }, relations: ['channel'] })
console.log(video.channel)
This issue was fixed by Hydra PR -> https://github.com/Joystream/hydra/pull/480
There were a bunch of PRs in this category because filtering query/s do not work for deeply nested fields
Sometimes it is feasible to define a field as a variant (instead of a separate entity); however, we can't choose that because of Hydra querying limitations, i.e., Querying doesn't work for deeply nested variant fields. Point 3 of Klaudiusz comment points out the same issue
Already discussed this with Klaudiusz in chat that maybe we need to considerably improve the test coverage of mappings. There are a lot of use cases, and we are not testing all. If we have considerably well test coverage then most of the time logical bugs would be caught in tests in contrast to, someone trying some edge case scenario in production and breaking the QN.
If possible we should target writing unit tests for individual mappings (mocking database manager & event objects etc.)
It is very unfortunate that when a developer or end-user that consumes QN content encounters a bug in the mappings needs to wait for a fix from QN developers and is usually blocked until the fix is implemented and deployed.
As @zeeshanakram3 has pointed out, we could increase the test coverage for the mappings. While that would definitely be beneficial, it would be really costly if we wanted to cover all mappings, I'm afraid. That being said, an introduction of the unit test would decrease development time as well as CI test running time in comparison to the integration test that we solely use to test the QN now.
Many times we had a situation when Atlas or Pioneer developer made a blockchain transaction that caused a crash in the processor (due to a bug). Then they wished the transaction never happened because then it blocked all of the other work they could be doing if the processor was still running.
When I was debugging one of the recent council mappings issues that happened on the current Olympia-net in a quite high block numbered block, I periodically backed up my local processor database using pg_dump
. E.g. I ended up having checkpoint dbs with processed data up to block 10,000
, 50,000
, etc. I could then swap db content anytime to inspect the state or start from a checkpoint with fixed mappings and save a lot of time that would it take to reprocess everything.
That brought me to the idea that we could maybe automatize this db dump and recovery for our development servers. That would enable us to create checkpoints that we can revert the QN state to. We would also need to find a way to also revert blocks in the dev Joystream node to make this work. From what I found, it could be possible by using Substrate --revert
command; or we could save chain state to files in a similar way as QN db.
I guess the biggest issue is the limited filtering possibilities @kdembler mentioned. This issue creates a lot of overheads in the form of what we lately tend to call flattening data
- basically introducing data duplicates in various places, thus breaking Normal forms. If we expand the options for filtering, I think the API is not so bad.
There are two main types of issues we are constantly encountering.
The first one is that processor crashes on some bad implementation on an unexpected situation, and we are oblivious to that until some frontend (Atlas, Pioneer, etc.) user reports that.
To remedy this issue, we could expose data saved in the processor's db table processed_events_log
by introducing a new ProcessedEventsLog
entity to a schema that would unmask this now private table. We could then set up a monitoring service that would query Joystream node, indexer, and processor; compare their highest block number, and alert us if the indexer or processor lags behind the Node.
The second issue is that due to its linear nature, the processing of one bugged mapping crashes the whole processor and blocks any further processing, even on completely unrelated mappings. E.g. broken council mapping will prevent the processing of storage events.
In some cases, the problem could be solved by introducing extra metadata to event mappings definitions describing dependency graph. Each event could have a category associated and a list of categories it depends on. Once a mapping would throw an error, the processor could gracefully overcome this error, block processing of further events in depending categories, and report an issue.
Not only could production setups benefit from this feature, but development instances could use this feature to further distinguish currently unstable yet unrelated features.
Looking at the list of my recent PRs, they contained a combination of:
To provide high availability and performant service we are running two clusters.
Switching between clusters requires manual DNS record update of query.joystream.org CNAME record. This can be mitigated by having two QNs running in separate namespaces in the same cluster. -> More complex setup.
If some records do not have deterministic Ids does switching between clusters cause any odd behaviour on client if client is passing id in query?
Not clear when it is safe to only reprocess vs, re-indexing and re-processing. On more than one occasion there were no types changes or schema changes (only mappings code changes) and it seemed reasonable to only reset the processor db and re-process but it caused processor to fail at some point in the mappings (always related to council/election)
Updating production cluster without downtime is always lengthy manual process. Switch to backup cluster Build new joystream/apps docker image Deploy to k8s, update AWS loadbalancer DNS records Wait for sync Switch back to production Update backup cluster
Given that query-node is in the monorepo sometimes it can be problematic to update if other services are running from the same repo (argus, colossus) but easy workaround to have separate directories on production machine.
Processor was much much faster pre-olympia.. what happened? Is is due to processing every single block or just the fact there are more events to process and mappings are slower (performing slow queries before updates)?
Indexing has always been quite slow. How can we speed it up. Is too much logging to console slowing things down?
In general debugging of actual queries on my own is very limited as I have very little knowledge of details of the schema or business logic in some cases. So I am limited to ensuring queries do not fail but I cannot asses whether the actual results are "correct". (Unless of course I go back and review mappings implementation)
Debugging operation of the clusters ie checking logs, and checking on status of clusters is reasonably easy but we have no downtime monitoring or alerting tools in place. Currently experimenting with New Relic for k8s cluster and Prometheus (for joystream-node) with some mixed results.
I always seem to forgot to update Linode load balancer certificate in time, so I want to move away from that and possibly use an nginx/caddy server on a standalone host instead. This isn't an issue for the AWS k8s cluster.
We should send/aggerate the logs from the active query-nodes to a central location that can be accessed by team.
There needs to be a way to write unit tests for mappings. Only way we can currently test mappings is with integrations tests, and we do not have full coverage there.
Question
Should we do this meeting sooner? Depends on how soon people can get their deliverables ready, see below.
Meeting
Tue 10th
Participants
Background
The Query Node (QN) is an instrumental piece of the infrastructure that not only powers key applications in the Joystream ecosystem, namely Pioneer and Atlas, but also powers key secondary infrastructure software like storage and content delivery nodes, as well as the CLI and other t
Current Problems
Here is an incomplete and unranked list of problem that seem to occur as part of the development and usage of the QN.
Note: This list reflects my point of view, which is not very granular or complete, and so it may need adjustments before the meeting.
1. Unreliable
It very frequently goes down at various stages of both development and in production. In particular bugs during development are very costly, as teams relying on it have to wait around - perhaps mock APIs, while a QN developer can take a look, fix the problem and redeploy.
2. Impotent API
It cannot, as a matter of Hydra constraints, offer the APIs that either client teams prefer, or absolutely need as a matter of basic product features. This often requires complex coordination about what product features are required vs. nice to have, and exactly how to work around limitations which are unavoidable. Frequently we now have to build redundant state into the API, which means mappings become harder to write correctly, further contributing to the prior problem.
3. Synch issues
This one may just be me misunderstanding
On occasion it seems that we have had problems caused by making mistakes in deployments, in particular from where to process for a given purpose.
4. Incident Management
It seems we rarely have errors detected by our own Jsgenesis personell, it's always the community which finds the problem and then has to alert someone on Discord, and then whomever catches this problem ends up having to try to get to whomever is able to look into the problem.
5. Debugging
It seems it does take quite a bit of time determinig why exactly the QN went down ex-post.
Meeting Purpose
Map out and prioritise what changes we should do to
a. Hydra b. QN development process c. QN deployment process d. QN debugging process e. QN infrastructure routines
What we are looking for is to identify the really high value and cheap interventions we can do right now, because there are other urgent work we need to commence work on imminently, specifically
Deliverables Deadline
Submit as reply to this issue no later than:
Friday 6th 18pm CET
Deliverables
For each person in the meeting , please review your past tasks, bugs, bug fixes, deployment problems, pain dealing with QN, and which of the buckets 1-5 it falls into. In particular perhaps
Agenda