Create verbatim manifest formats

Each section below represents be at least one issue in this epic. Some tickets have already been created, for others we'll wait a little to see how things pan out.

Push down HCA specifics to a `Bundle` subclass (#4940, done)

The in memory representation of metadata in Azul is heavily influenced by the old DCP/1 representation of bundles in the DSS which distinguished between manifest entries and metadata. This needlessly complicates the AnVIL code and it would further complicate passing along verbatim metadata. We'll need to address this first.

This boils down to removing the .metadata_files and .manifest attributes from azul.indexer.Bundle and leaving the definition of any metadata-related attributes up to subclasses. The requirement is that the repository and metadata plugin for any given catalog agree on the data structure referenced by these attributes. This refactoring is complicated by the fact that the current class hierarchy first differentiates by repository (TDR vs DSS) and then by atlas (AnVIL vs HCA) whereas the introduction of metadata-related repositories should occur at the atlas level. To resolve, swap the levels, use mixins or accept some degree of duplication.

~Since the AnVIL metadata plugin doesn't use the HCA Metadata API, the AnVIL repository plugin doesn't actually need to emulate the links.json structure, as it currently does. It can continue to use the adjacency list approach but should remove mentions of protocols and the HCA-specific schema cruft.~

Ensure that each `Bundle` instance is a complete subgraph (#4941, done)

… so that it even includes entities that are not transformed into contributions, as long as the discovery of those entities isn't prohibitively expensive and their cardinality isn't prohibitively high.

This is already the case for HCA since every HCA Bundle instance contains all entities of the respective subgraph, including the links entity defining the subgraph.

For AnVIL this may require following more FK relations during the subgraph traversal, and adding entities currently not indexed. The goal is for an AnVIL Bundle instance to include all schema entities in a given subgraph, and for these entities to reflect all significant columns of the corresponding TDR table row. It does not need to cover non-schema entities.

Spike to review AnVIL TDR schema, for unindexed entities, properties, or FK relations. Determine insignificant columns, if any, such as datarepo_row_id. Informally, an insignificant column is one that no external user of Terra would want to be present in their workspace after a hand-over.

Represent verbatim metadata as `Document` subtype `Replica` (#4942, sub-epic, done)

Each Replica instance will represent the significant content of a metadata entity verbatim.

Unlike Aggregate and Contribution, Replica instances will not be segregated into separate indices by entity type. The transformers of a metadata plugin will be extended to emit Replica instances, in addition to the Contribution instances they currently emit.

For most entity types, the metadata plugin will emit a replica and a contribution, for some entity types (e.g. Diagnoses) it only emits a replica.

Replica documents have the following shape:

{
    "entity_type": "…",
    "content": {…},
    "hub_ids": ["…", …]
}

The .entity_type property is a string classifying replicas by type. For AnVIL replicas this is the name of the AnVIL table from which the replica originates. For HCA this is the entity type as specified in the DCP/2 spec, which happens to also be a table name.

The .content property contains the verbatim metadata entity as supplied by the repository plugin. The .hub_ids property contains a list of IDs of

1) file entities the replica is derived from and 2) file entities that are derived from the replica.

For now only files are used as hubs, because all of Azul's current manifest formats are generated from the aggregate index for files. In the future we may very well add more types of hub entities. A replica's hub entities aren't necessarily confined to a single bundle. When a replica document is created, its hubs all come from a single bundle but as more bundles are indexed, the replica document is updated efficiently (see below) by adding the IDs of newly discovered hubs to the .hub_ids property, without having to resend the potentially large value of the .content property.

ES must be prevented from making any assumptions about replica content, from analyzing it, and from reordering array items or dictionary entries contained therein. The easiest way to do that is to map (in ES) the .content property of replica documents using object for type with enabled set to false. If problems arise with that approach, an alternative would be to make content a string containing serialized JSON, and to map it as text with enabled and doc_values set to false.

Replica documents are content-addressed: a document's ID is the hash of the value of its .content property. This enables us to optimistically create replicas with PUT /<index>/_create/<_id> (or to PUT /<index>/_doc/<_id> with "op_type": "create"). If that request fails with 409, we know that 1) the replica already exists and 2) that its .content is up-to-date. We can then follow up with a POST /<index>/_update/<_id> to update just the .hub_ids property (see below) using a scripted update.

[Edit @nadove-ucsc] TODO: What happens with stale replicas (same entity ID, but different content hash, due to outdated contents)? This scenario should only occur when we do incremental indexing, so we can probably ignore this for now. https://github.com/DataBiosphere/azul/issues/5565

Potential optimization: Check first with HEAD <index>/_doc/<_id> if the document already exists. If so update it, otherwise create it, optimistically as described above. This would save us from sending potentially large replicas only to discover that they already exist. The probability of this occurring, and the potential performance gain, depend on the subgraph overlap, the document size, ES latency and throughput etc. so it needs to be benchmarked.

Spike to define the hash implementation; it should ignore insignificant differences such as the ordering of dictionary keys.

Create verbatim handover for AnVIL using replicas (#4943, sub-epic, in progress)

First, add test coverage for AnVIL: #5056 and #4606

Currently, all manifests generated by Azul are file-centric: the service applies the filters to the files index and transforms the matching file documents to entries of a manifest in the desired format. Some manifests efficiently page through matching files and create manifest partitions that are later concatenated to produce a single manifest object in the storage bucket. To implement verbatim manifests, we only need to determine the IDs of matching files and perform a "join" against the replica index. We load one page of matching files, and then execute another request using the terms query against the .hub_ids property of replica documents. The manifest is then composed from those replica documents.

We observe that a single file may be a hub of many replicas, and that a single replica may have many hubs. We'll address the former observation by paging though the response to the terms query. Each page of files therefore produces pages of replicas.

The latter observation, if unaddressed, could cause replicas being emitted more than once. To prevent that, we need to keep track of replicas already emitted. The easiest way to do that is by maintaining the set of emitted replica IDs in memory. This implies that verbatim manifests can't be partitioned because each partition is computed in a separate Lambda context, and the amount of state we can persist between partitions is limited to a few hundred KiB. Consequently, we have to generate an entire verbatim manifest in a single 15min Lambda invocation which obviously imposes a potentially impractical upper bound on the manifest size.

The Broad's AvroPFB import imposes yet another constraint that forces the current (non-verbatim) handover implementation to create the manifest PFB in a single Lambda invocation: it requires that referenced entities are emitted before referencing entities. To address this constraint, the current handover implementation keeps the entire manifest in memory and performs a topological sort before writing the manifest. For that constraint to be relevant in the verbatim implementation of the handover manifest, the code would need be aware of foreign key (FK) relationships between entities in the AnVIL schema, and declare them in the PFB schema that precedes the data in the AVRO manifest. We will postpone FK-awareness until later, so that the initial implementation of the verbatim hand-over manifest will not need to satisfy Broad's topological ordering constraint. The means that the verbatim handover generation only needs to maintain a set of IDs in-memory, rather than the entire output.

Side note: I don't think the topological ordering constraint represents a fair distribution of effort between Azul and Terra, or at least I'd like to hear concrete reasons why the constraint is warranted, and avoids undue complications on the Terra side. As we can see here, it does create significant complications on our end.

Optimization: Instead of tracking the IDs of all emitted replicas, we don't have to keep track of IDs of replicas with only one hub, since we know there are no other hubs that would cause their re-emission, thereby reducing the memory footprint of the set at the expense of simple conditional.

Optimization: Instead of tracking replica IDs, we could also track the IDs of hubs that referred to emitted replicas. Consider, replica r1 derived from two files f1 and f1000 that both match the filter. Replica r1's .hub_ids is ["f1", "f1000"]. When paging through the files matching the filter, we'll see f1 first. When enumerating through the associated replicas, we'll examine r1. At that point we know that we may encounter f1000 later, and with it r1. We'll put f1 into the set, indicating that we took care of all of f1's replicas. If we later do encounter f1000 and r1 we know we can skip r1. In the case where f1 does not match the filter, the f1000 will be the first time we encounter r1, f1 will be absent from the set so we'll emit r1. We don't even need to add f1000 to the set because we know f1 can't come up anymore: we are enumerating files in order of ID and f1 < f1000. This optimization can be further improved by considering paging: f1 and f1000 could end up on the same page in which case we may not even want to add f1 to the set. If r1 has three hubs, f1, f1000 and f2000, with f2000 not matching the filter, we would need to add f1000 to the set.

We'll implement the first optimization but post-pone the latter since it needs more thought and possibly a proof.

While reading replicas, the manifest generator analyzes their .entity_type and .content properties and builds an corresponding Avro schema on the fly. Each distinct value in .entity_type should produce a Avro schema type, and each key in .content becomes a property in that type. Since the Avro schema precedes the entities in the resulting AvroPFB file, this approach requires holding all replicas in memory. In the future we will consider an optimization in which the indexer (as opposed to the metadata generator) discovers the schema of the replicas and persists it, maybe as a special entity. We could generate the AvroPFB schema from the AnVIL schema but the next section breaks that approach.

Once all replicas have been read, the manifest generator writes the static PFB schema, the dynamically built Avro schema, and the replicas.

Include orphans in verbatim manifest for AnVIL (#6529)

This section was inserted later. Please refer to the description of the respective issue (https://github.com/DataBiosphere/azul/issues/6529) for the design specification.

Include non-schema replicas when not filtering by dataset (#4947)

Note that this section needs to be updated to reflect the how we handle orphans, which will cover the indexing of the non-schema AnVIL tables, but not their inclusion when the LHS of the join is files (the

When indexing a subgraph, the AnVIL TDR repository plugin will parse the value of the source_datarepo_row_ids column of each entity table and fetch the associated table rows. Each value is a string like workspace_attributes:7a22b629-9d81-4e4d-9297-f9e44ed760bc. The first part before the colon is a table name while the second part is a foreign key. The plugin will collect these references, and, once all schema entities have been fetched, group these references by table name and issue separate queries, one per table and batch of N foreign keys for that table. The concrete data structure for holding the references to non-schema entities is left up to the implementer. When the repo plugin processes the results to these queries, it will inject the non-schema entities into a special .azul_source_datarepo_rows property of the schema entities that referred to them. This process efficiently resolves the references in the .source_datarepo_row_ids property of a schema entity into a verbatim representation of the referenced non-schema entity rows.

The metadata plugin will emit a replica document for each entry in .azul_source_datarepo_rows of any schema entity, using the same .hub_ids as the replica of the schema entity, but with .entity_type set to name of the non-schema table i.e., what was found before the colon in the value of the .source_datarepo_row_ids column of the schema table containing the entity. The rest of the pipeline will work as intended. The manifest generator will emit these non-schema replicas and dynamically discover their shape.

Verbatim handover for HCA (#6121, sub-epic, in progress, some children are blockers on the corresponding AnVIL sub-epic #4943)

The prime directive should be for the handover to reconstitute the TDR snapshots tables in the Terra workspace. The Terra workspace should be populated with all tables and columns, but only the rows for matching files, the entities derived from them and the entities they are derived from.

Generic, verbatim manifest for HCA (#6028, merged)

I'd also like to implement a verbatim manifest that's not in Terra's somewhat obscure AvroPFB format. A JSON or JSONL format seems appropriate.

Capture HCA subgraph structure in verbatim manifest (#6073, in progress)

The HCA metadata plugin emits a replica for links.json that can be used to reconstitute the links table row.

For AnVIL, there is no links table. Instead the entity tables have FK columns. Azul retains them as part of the replicas. There still are subgraphs in AnVIL, but they are an ephemeral artifact of the AnVIL TDR repository plugin.

TBD: If the verbatim HCA handover is limited to matching files and connected replicas, some of the entities referenced by the .content column in the emitted links table rows will be absent from their respective entity table in the workspace, if they are part of the same subgraph as a matching file but not connected to that file.

Allow for entities of types other than `files` to serve as hubs (TODO: create ticket)

Really any entity should be able to serve as the parent end of the join against the replicas. This is lower priority.

DataBiosphere / azul