Include orphans in verbatim manifest for AnVIL

hannes-ucsc commented 2 weeks ago

Definitions

All replicas will refer to the dataset (project) they are part of as one of their hubs.

TBD: Should we make the hub_ids field of each replica a list of entity references? I'm leaning towards not doing so because entity IDs are UUIDs and should be unique without the entity type qualifier. If we needed to look up entities by a replica's hub IDs, the qualifier would help in determining the target index. At the moment, such a lookup is not needed.

An orphan is an entity that does not materialize as an inner or outer entity in the contributions index. By that definition, AnVIL's supplementary files are not orphans. An orphan may have non-null references (foreign keys) to other entities, or other entities may refer to it, but if Azul does not follow those references, the entity is still considered orphaned under this definition.

Note that this definition of orphan is Azul-centric. A more universal definition would define them as entities that don't relate to a file, or, even more restrictive, that don't relate to any other entity. However, it is immaterial whether a relation exists or whether Azul just doesn't index it. We know that there are many snapshots with donor and biosample entities that don't relate to a file and the Azul-centric definition orphan includes these entities, and the resolution to this ticket will cover them as well.

While HCA may have orphans, this issue is only concerned with AnVIL orphans. The implementor is free to choose whether their changes increase the number of entities included in the verbatim manifests for HCA, their changes just mustn't reduce that number, and their choice should be driven by expediency.

AnVIL has study-specific tables that are not governed by the AnVIL table schema. Azul is not aware of the relations between these tables so all rows in these tables are orphans under the above definition.

The replica of an orphan will have no hub other than the dataset (optional: project) it is part of.

Service

The intent behind this issue is to include the replicas of orphans in verbatim manifests for AnVIL, if the user is filtering exclusively by dataset.

Currently, when a verbatim manifest is prepared, the filter is applied to the aggregate files index and the matching documents are retrieved. This is the left side of the join. Then all replicas referring to those files as hubs are retrieved, that's the right side of the join. For each file, we also determine its dataset (project) ID (the implicit hub) and use that to retrieve the dataset (project) replica (the implicit hub, somewhat of a misnomer). Note that this is not the same as retrieving all replicas that have the dataset (project) as a hub. Doing so would extend the set of returned replicas to that of all replicas of every matching file's dataset (project). That's not what we want when the left side of the join is the aggregate files index.

When the filter only specifies the dataset ID field, it will be applied to the datasets index as the left side of the join. Because every replica, orphaned or not, has the containing dataset's replica ID as a hub, this effectively yields all replicas in that dataset. We don't need to use the implict hub lookup in that case since the dataset replica refers to itself as a hub.

The query against the dataset index will need to enforce the source filter. Otherwise, we would return replicas from managed-access snapshots to unauthorized users. This is in contrast to the /index/datasets endpoint which is also backed by that index, but the source filter is not enforced when serving that endpoint. The endpoint only returns project aggregates, which are deemed public, even for managed-access snapshots.

Indexer

In order to populate the replica index with the replicas of orphans, we'll introduce a new type of "replica" bundle. I don't think that type will manifest explicitly as a new subclass of TDRAnvilBundleFQID, but simply as a set of instances of that class, sharing a particular combination of attributes, as defined below.

We generalize a TDRAnvilBundleFQID instance to represent a batch of rows in a particular table. The batch will be defined in terms of particular prefix of the datarepo_row_id of these rows. All rows in a batch will share the same prefix and will come from the same table. The name of the table will be represented by the bundle FQID's entity_type attribute. The batch prefix will be represented by the bundle's UUID and a new attribute called batch_prefix_length such that the first batch_prefix_length characters of the bundle UUID appear at the beginning of the datarepo_row_id column of every row in the batch. This idea was first proposed by Noa (@nadove-ucsc).

The entity_type attribute of TDRAnvilBundleFQID will be renamed to table_name and its type will be changed from BundleEntityType (an Enum subclass) to a plain string. BundleEntityType will be renamed to BundleType.

The table_name attribute of legacy (primary, supplementary, and DUOS) bundle FQIDs will be 'anvil_biosample' (BundleType.primary), 'anvil_file' (BundleType.supplementary) and 'anvil_dataset' (BundleType.duos), respectively.

The UUID of legacy bundles is derived bijectively from the datarepo_row_id by switching the UUID version. The batch_prefix_length attribute for legacy bundles will be fixed at 8. The actual batch prefix manifests as the first eight characters of the bundle's UUID. The batch size of legacy bundles is 1 and an assertion will fire if there are more than one row with that prefix in the respective table.

The batch prefix of a replica bundle is at least the partition prefix, with more characters added as needed to break up larger tables into multiple batches, one batch per bundle. The UUID of these replica bundles begins with the batch prefix, and has the same version as the legacy bundle UUIDs. All other "random" bits are 0. Legacy bundles don't contribute any entities to the index, only replicas, so their UUIDs will not persist.

The list_bundles endpoint will enumerate all tables in the snapshot, using the INFORMATION_SCHEMA or an BQ Python API method. For tables backing replica bundles, the method will then use the following query to determine the optimal batch size for the tables backing replica bundles. As mentioned above, the batch size for legacy bundles is fixed at 1.

DECLARE partition_prefix STRING;
DECLARE partition_prefix_len INT64;
SET partition_prefix = 'a';
SET partition_prefix_len = LENGTH(partition_prefix);
WITH
  counts AS (
    SELECT 
      SUBSTR(datarepo_row_id, partition_prefix_len + 1, 1) as p0, 
      SUBSTR(datarepo_row_id, partition_prefix_len + 2, 1) as p1, 
      SUBSTR(datarepo_row_id, partition_prefix_len + 3, 1) as p2, 
      SUBSTR(datarepo_row_id, partition_prefix_len + 4, 1) as p3, 
      COUNT(*) as num_rows
    FROM
      `datarepo-3594cc06.ANVIL_HPRC_20240401_ANV5_202406261913.file_inventory`
    WHERE STARTS_WITH(datarepo_row_id, partition_prefix)
    GROUP BY ROLLUP (p0, p1, p2, p3)
  )
SELECT 
  partition_prefix_len + LENGTH(CONCAT(
    IFNULL(p0,''),
    IFNULL(p1,''),
    IFNULL(p2,''),
    IFNULL(p3,'')
  )) AS batch_prefix_length, 
  ROUND(AVG(num_rows),1) as average_batch_size,
  COUNT(*) as num_batches
FROM 
  counts
GROUP BY
  batch_prefix_length

The central element of this query is the GROUP BY ROLLUP clause of the counts query. It essentially tries a fixed number of different batch prefix lengths and compiles the number of matching rows for each such prefix. A subsequent query averages the bundle count over all prefixes of a specific length. The example query above hard-codes the maximum prefix length that it tries to be 4. The implementor should write a function that generates the query given a table name, a partition prefix and a maximum batch prefix length.

Given the result of the query, the list_bundles method will pick a batch prefix length that produces batches of around 256 rows. The method then emits another query against the same table to enumerate all batches, and emits the bundle FQIDs for them.

Indexing a supplementary bundle, i.e. a bundle with an FQID whose table_name is equal to BundleType.supplementary.value, will upsert a file replica, and, if the file row's is_supplementary column is set, contribute a file entity and a dataset (project) entity. Both the file entity and the associated dataset (project) entity will be hubs of the file replica.

Indexing a DUOS bundle, i.e. a bundle with an FQID whose table_name is equal to BundleType.duos.value will upsert a dataset replica and contribute the DUOS description to the dataset entity. The dataset replica will list the dataset entity as a hub.

TBD: DUOS description in dataset replica? Special replica for DUOS description?

Indexing a primary bundle, i.e., a bundle with an FQID whose table_name is equal to BundleType.primary.value will contribute the entities discovered during graph traversal starting at that biosample (including the dataset), and upsert a replica of each of the contributed entities, except for the dataset. The dataset (project) entity will be added as a hub of every replica by this bundle.

Any other bundle is a replica bundle and indexing such a bundle will upsert a replica for every row in the batch. Each replica will have exactly one hub: the dataset. This means that the datarepo_row_id of the dataset entity needs to be known which implies that the indexer needs to fetch singleton row from the anvil_dataset table in the snapshot.

Eliminate `RepositoryPlugin.list_partitions` (#6531)

The list_partitions method will be removed. The average duration of the Lambda invocation that handles the reindex notification is under ten seconds in prod and anvilprod so there is a lot of leeway for not worrying about the partition size too much. We will stop tracking the number of bundles in each snapshot in the source tree, eliminate (move to the attic) the update_subgraphs_count.py script and any logic in environment.py files that is used to compute the common and partition prefix lengths. Typically, source specs won't reflect the common prefix or the partition prefix length, while still allowing for manual overrides (see below). Instead, we'll add a repository plugin method called partition_source that determines both values dynamically. This will be done a separate, first PR. The method will be called by the remote reindex functionality in azulclient.py. The initial implementation of that method for HCA will count the rows in the links table. The initial implementation of that method for AnVIL will count the rows in the anvil_biosample table plus the supplementary rows in the anvil_file table.

In main deployments, the method returns a common prefix of '' and heuristically computes a partition prefix length that yields partitions of up to 8,192 bundles. In other deployments (sandbox or personal), the methods heuristically computes a common prefix length containing between 2 and 32 bundles, and a partition prefix length of 1 (yes, some of the partitions could be empty, but we want the IT to cover multiple partitions in those deployments). While the length of the common prefix is computed, the concrete digits should be derived pseudorandomly but deterministically, e.g., by hashing the source spec.

The post_deploy make target must ensure that, if a common prefix is defined, the source contains between 1 and 512 bundles. The operator manual should document that sources failing that requirement can be fixed by manually overriding the common prefix in environment.py to be one digit shorter than the computed prefix if that prefix is empty, and one digit longer if the computed prefix contains more bundles than allowed.

When a subsequent PR modifies the list_bundles implementation for AnVIL as described above, the partition_source method for AnVIL will count all rows in all tables and estimate a common prefix and partition prefix length accordingly.