tdr_anvil's fetch_bundle lacks test coverage #5046

Closed dsotirho-ucsc closed 7 months ago

…due to TinyQuery not supporting the WITH clause

[x] Security design review completed; the Resolution of this issue does not …
- [x] … affect authentication; for example:
- OAuth 2.0 with the application (API or Swagger UI)
- Authentication of developers with Google Cloud APIs
- Authentication of developers with AWS APIs
- Authentication with a GitLab instance in the system
- Password and 2FA authentication with GitHub
- API access token authentication with GitHub
- Authentication with
- [x] … affect the permissions of internal users like access to
- Cloud resources on AWS and GCP
- GitLab repositories, projects and groups, administration
- an EC2 instance via SSH
- GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
- [x] … affect the permissions of external users like access to
- TDR snapshots
- [x] … affect permissions of service or bot accounts
- Cloud resources on AWS and GCP
- [x] … affect audit logging in the system, like
- adding, removing or changing a log message that represents an auditable event
- changing the routing of log messages through the system
- [x] … affect monitoring of the system
- [ ] … ~introduce a new software dependency like https://github.com/DataBiosphere/azul/issues/5046#issuecomment-1915746834
- Python packages on PYPI
- Command-line utilities
- Docker images
- Terraform providers
- [x] … add an interface that exposes sensitive or confidential data at the security boundary
- [x] … affect the encryption of data at rest
- [x] … require persistence of sensitive or confidential data that might require encryption at rest
- [x] … require unencrypted transmission of data within the security boundary
- [x] … affect the network security layer; for example by
- modifying, adding or removing firewall rules
- modifying, adding or removing security groups
- changing or adding a port a service, proxy or load balancer listens on
[x] Documentation on any unchecked boxes is provided in comments below

@danielsotirhos to fix title & description.

Spike to evaluate https://github.com/goccy/bigquery-emulator as a possible alternative to TinyQuery.

That emulator (BigQuery Emulator, BQE) uses https://github.com/goccy/go-zetasqlite which states that WITH is supported.

I can't re-enable the skipped test indexer.test_anvil.TestAnvilIndexer.test_fetch_bundle because the can it uses (826dea02-e274-affe-aabc-eb3db63ad068.tables.tdr.json) is gone.

I can make the tests in indexer.test_tdr.TestTDRHCAPlugin pass individually with BQE and the following patch:

Index: test/indexer/test_tdr.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/test_tdr.py b/test/indexer/test_tdr.py
--- a/test/indexer/test_tdr.py  (revision 1d654657258bff023c6f46a9d74f7412b71a4b0a)
+++ b/test/indexer/test_tdr.py  (date 1699949394655)
@@ -2,20 +2,15 @@
     ABCMeta,
     abstractmethod,
 )
-from collections.abc import (
-    Iterable,
-    Mapping,
-)
-from datetime import (
-    timezone,
-)
 import json
 from operator import (
     attrgetter,
 )
 from typing import (
     Callable,
+    ClassVar,
     Generic,
+    Sequence,
     Type,
 )
 import unittest
@@ -32,26 +27,25 @@
 from furl import (
     furl,
 )
+from google.api_core.client_options import (
+    ClientOptions,
+)
+from google.auth.credentials import (
+    AnonymousCredentials,
+)
+from google.cloud import (
+    bigquery,
+)
 from more_itertools import (
-    first,
-    one,
     take,
 )
-from tinyquery import (
-    tinyquery,
-)
-from tinyquery.context import (
-    Column,
-)
 import urllib3
 from urllib3 import (
     HTTPResponse,
 )

 from azul import (
-    RequirementError,
     cache,
-    cached_property,
     config,
 )
 from azul.auth import (
@@ -59,7 +53,6 @@
 )
 from azul.bigquery import (
     BigQueryRow,
-    BigQueryRows,
 )
 from azul.indexer import (
     SourcedBundleFQID,
@@ -68,6 +61,9 @@
     configure_test_logging,
     get_test_logger,
 )
+from azul.oauth2 import (
+    ScopedCredentials,
+)
 from azul.plugins.repository import (
     tdr_anvil,
     tdr_hca,
@@ -86,10 +82,12 @@
     TDRClient,
     TDRSourceSpec,
     TerraClient,
+    TerraCredentialsProvider,
 )
 from azul.types import (
     JSON,
     JSONs,
+    reify,
 )
 from azul_test_case import (
     AnvilTestCase,
@@ -97,6 +95,9 @@
     DCP2TestCase,
     TDRTestCase,
 )
+from docker_container_test_case import (
+    DockerContainerTestCase,
+)
 from indexer import (
     BUNDLE,
     CannedBundleTestCase,
@@ -110,54 +111,52 @@
     configure_test_logging(log)

-@attr.s(kw_only=True, auto_attribs=True, frozen=True)
+class MockTDRClient(TDRClient):
+    netloc: ClassVar[tuple[str, int] | None] = None
+
+    def _bigquery(self, project: str) -> bigquery.Client:
+        # noinspection PyArgumentList
+        host, port = self.netloc
+        options = ClientOptions(api_endpoint=f'http://{host}:{port}')
+        # noinspection PyTypeChecker
+        return bigquery.Client(project=project,
+                               credentials=AnonymousCredentials(),
+                               client_options=options)
+
+
+@attr.s(frozen=True, auto_attribs=True)
+class MockCredentials(AnonymousCredentials):
+    project_id: str
+
+
+@attr.s(frozen=True, auto_attribs=True)
+class MockCredentialsProvider(TerraCredentialsProvider):
+    project_id: str
+
+    def insufficient_access(self, resource: str) -> Exception:
+        pass
+
+    def scoped_credentials(self) -> ScopedCredentials:
+        # noinspection PyTypeChecker
+        return MockCredentials(self.project_id)
+
+    def oauth2_scopes(self) -> Sequence[str]:
+        pass
+
+
 class MockPlugin(TDRPlugin, metaclass=ABCMeta):
-    tinyquery: tinyquery.TinyQuery
-
-    def _run_sql(self, query: str) -> BigQueryRows:
-        log.debug('Query: %r', query)
-        columns = self.tinyquery.evaluate_query(query).columns
-        num_rows = one(set(map(lambda c: len(c.values), columns.values())))
-        # Tinyquery returns naive datetime objects from a TIMESTAMP type column,
-        # so we manually set the tzinfo back to UTC on these values.
-        # https://github.com/Khan/tinyquery/blob/9382b18b/tinyquery/runtime.py#L215
-        for key, column in columns.items():
-            if column.type == 'TIMESTAMP':
-                values = [
-                    None if d is None else d.replace(tzinfo=timezone.utc)
-                    for d in column.values
-                ]
-                columns[key] = Column(type=column.type,
-                                      mode=column.mode,
-                                      values=values)
-        for i in range(num_rows):
-            yield {k[1]: v.values[i] for k, v in columns.items()}
-
-    def _full_table_name(self, source: TDRSourceSpec, table_name: str) -> str:
-        return source.bq_name + '.' + table_name
+    netloc: str
+    project_id: str

     @classmethod
-    def _in(cls,
-            columns: tuple[str, ...],
-            values: Iterable[tuple[str, ...]]
-            ) -> str:
-        return ' OR '.join(
-            '(' + ' AND '.join(
-                f'{column} = {inner_value}'
-                for column, inner_value in zip(columns, value)
-            ) + ')'
-            for value in values
-        )
-
+    def _tdr(cls):
+        credentials_provider = MockCredentialsProvider(cls.project_id)
+        tdr = MockTDRClient(credentials_provider=credentials_provider)
+        MockTDRClient.netloc = cls.netloc
+        return tdr

-class TestMockPlugin(AzulUnitTestCase):

-    def test_in(self):
-        self.assertEqual('(foo = "abc" AND bar = 123) OR (foo = "def" AND bar = 456)',
-                         MockPlugin._in(('foo', 'bar'), [('"abc"', '123'), ('"def"', '456')]))
-
-
-class TDRPluginTestCase(TDRTestCase, CannedBundleTestCase[BUNDLE], Generic[BUNDLE]):
+class TDRPluginTestCase(TDRTestCase, CannedBundleTestCase[BUNDLE], DockerContainerTestCase, Generic[BUNDLE]):

     @classmethod
     @abstractmethod
@@ -168,18 +167,28 @@

     _drs_domain_name = str(mock_service_url.netloc)

-    @cached_property
-    def tinyquery(self) -> tinyquery.TinyQuery:
-        return tinyquery.TinyQuery()
-
     @cache
     def plugin_for_source_spec(self, source_spec) -> TDRPlugin:
         # noinspection PyAbstractClass
         class Plugin(MockPlugin, self._plugin_cls()):
-            pass
+            netloc = self.netloc
+            project_id = self.source.spec.project
+
+        return Plugin(sources={source_spec})

-        return Plugin(sources={source_spec},
-                      tinyquery=self.tinyquery)
+    netloc: tuple[str, int] | None = None
+
+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+        cls.netloc = cls._create_container(image='ghcr.io/goccy/bigquery-emulator:arm64',
+                                           container_port=9050,
+                                           command=[
+                                               '--log-level=debug',
+                                               '--port=9050',
+                                               '--project=' + cls.source.spec.project,
+                                               '--dataset=' + cls.source.spec.bq_name
+                                           ])

     def _make_mock_tdr_tables(self,
                               bundle_fqid: SourcedBundleFQID) -> None:
@@ -196,32 +205,34 @@
                                 table_name: str,
                                 rows: JSONs) -> None:
         schema = self._bq_schema(rows[0])
-        columns = {column['name'] for column in schema}
+        columns = {column.name for column in schema}
+        json_type = reify(JSON)

-        def dump_row(row: JSON) -> str:
+        def dump_row(row: JSON) -> JSON:
             row_columns = row.keys()
             # TinyQuery's errors are typically not helpful in debugging missing/
             # extra columns in the row JSON.
             assert row_columns == columns, row_columns
-            row = {
-                column_name: (json.dumps(column_value)
-                              if isinstance(column_value, Mapping) else
-                              column_value)
+            return {
+                column_name: (
+                    json.dumps(column_value)
+                    if isinstance(column_value, json_type) else
+                    column_value
+                )
                 for column_name, column_value in row.items()
             }
-            return json.dumps(row)

-        self.tinyquery.load_table_from_newline_delimited_json(
-            table_name=f'{source.bq_name}.{table_name}',
-            schema=json.dumps(schema),
-            table_lines=map(dump_row, rows)
-        )
+        plugin = self.plugin_for_source_spec(source)
+        bq = plugin.tdr._bigquery(source.project)
+        table_name = plugin._full_table_name(source, table_name)
+        # noinspection PyTypeChecker
+        table = bigquery.Table(table_name, schema)
+        bq.create_table(table=table)
+        bq.insert_rows(table=table, selected_fields=schema, rows=map(dump_row, rows))

-    def _bq_schema(self, row: BigQueryRow) -> JSONs:
+    def _bq_schema(self, row: BigQueryRow) -> list[bigquery.SchemaField]:
         return [
-            dict(name=k,
-                 type='TIMESTAMP' if k == 'version' else 'STRING',
-                 mode='NULLABLE')
+            bigquery.SchemaField(name=k, field_type='TIMESTAMP' if k == 'version' else 'STRING')
             for k, v in row.items()
         ]

@@ -255,7 +266,7 @@

     def test_list_bundles(self):
         source = self.source
-        current_version = '2001-01-01T00:00:00.000001Z'
+        current_version = '2001-01-01T00:00:00.100001Z'
         links_ids = ['42-abc', '42-def', '42-ghi', '86-xyz']
         self._make_mock_entity_table(source=source.spec,
                                      table_name='links',
@@ -279,35 +290,6 @@
         # Test valid links
         self._test_fetch_bundle(bundle, load_tables=True)

-        # Directly modify the canned tables to test invalid links not present
-        # in the canned bundle.
-        dataset = self.source.spec.bq_name
-        links_table = self.tinyquery.tables_by_name[dataset + '.links']
-        links_content_column = links_table.columns['content'].values
-        links_content = json.loads(one(links_content_column))
-        link = first(link
-                     for link in links_content['links']
-                     if link['link_type'] == 'supplementary_file_link')
-        # Test invalid entity_type in supplementary_file_link
-        assert link['entity']['entity_type'] == 'project'
-        link['entity']['entity_type'] = 'cell_suspension'
-        # Update table
-        links_content_column[0] = json.dumps(links_content)
-        # Invoke code under test
-        with self.assertRaises(RequirementError):
-            self._test_fetch_bundle(bundle,
-                                    load_tables=False)  # Avoid resetting tables to canned state
-
-        # Undo previous change
-        link['entity']['entity_type'] = 'project'
-        # Test invalid entity_id in supplementary_file_link
-        link['entity']['entity_id'] += '_wrong'
-        # Update table
-        links_content_column[0] = json.dumps(links_content)
-        # Invoke code under test
-        with self.assertRaises(RequirementError):
-            self._test_fetch_bundle(bundle, load_tables=False)
-
     @patch('azul.plugins.repository.tdr_hca.Plugin._find_upstream_bundles')
     def test_subgraph_stitching(self, _mock_find_upstream_bundles):
         downstream_uuid = '4426adc5-b3c5-5aab-ab86-51d8ce44dfbe'
@@ -343,6 +325,7 @@
         emulated_bundle = plugin.fetch_bundle(test_bundle.fqid)

         self.assertEqual(test_bundle.fqid, emulated_bundle.fqid)
+        assert isinstance(emulated_bundle, TDRHCABundle)
         # Manifest and metadata should both be sorted by entity UUID
         self.assertEqual(test_bundle.manifest, emulated_bundle.manifest)
         self.assertEqual(test_bundle.metadata_files, emulated_bundle.metadata_files)
Index: test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.dss.hca.json
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.dss.hca.json b/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.dss.hca.json
--- a/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.dss.hca.json   (revision 1d654657258bff023c6f46a9d74f7412b71a4b0a)
+++ b/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.dss.hca.json   (date 1699949111726)
@@ -207,7 +207,7 @@
         {
             "name": "library_preparation_protocol_0.json",
             "uuid": "2945bb1f-90de-42a3-afa1-f57a62c853f0",
-            "version": "2019-09-20T13:43:52.078000Z",
+            "version": "2019-09-20T13:43:52.178000Z",
             "content-type": "application/json; dcp-type=\"metadata/protocol\"",
             "size": 1109,
             "indexed": true,
@@ -231,7 +231,7 @@
         {
             "name": "dissociation_protocol_0.json",
             "uuid": "eaf15851-97e3-4e4b-b81b-0e625098f4d5",
-            "version": "2019-09-20T13:43:52.077000Z",
+            "version": "2019-09-20T13:43:52.177000Z",
             "content-type": "application/json; dcp-type=\"metadata/protocol\"",
             "size": 830,
             "indexed": true,
@@ -375,7 +375,7 @@
         {
             "name": "IDC9_L002_R1.fastq.gz",
             "uuid": "292b2faf-0db3-4ba1-a6a5-cb7bdfa9313d",
-            "version": "2019-09-24T09:35:08.057668Z",
+            "version": "2019-09-24T09:35:08.157668Z",
             "content-type": "application/gzip; dcp-type=data",
             "size": 1297795360,
             "indexed": false,
@@ -435,7 +435,7 @@
         {
             "name": "IDC9_L004_I1.fastq.gz",
             "uuid": "f0011946-dbea-4f87-9858-5e3fd32a9829",
-            "version": "2019-09-24T09:35:09.072703Z",
+            "version": "2019-09-24T09:35:09.172703Z",
             "content-type": "application/gzip; dcp-type=data",
             "size": 439983276,
             "indexed": false,
@@ -1371,4 +1371,4 @@
             ]
         }
     }
-}
\ No newline at end of file
+}
Index: src/azul/types.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/types.py b/src/azul/types.py
--- a/src/azul/types.py (revision 1d654657258bff023c6f46a9d74f7412b71a4b0a)
+++ b/src/azul/types.py (date 1699927355757)
@@ -131,7 +131,7 @@

 def reify(t):
     """
-    Given a parameterized ``Union`` or ``Optional`` construct, return a tuple of
+    Given a parameterized type construct, return a tuple of
     subclasses of ``type`` representing all possible alternatives that can pass
     for that construct at runtime. The return value is meant to be used as the
     second argument to the ``isinstance`` or ``issubclass`` built-ins.
@@ -157,6 +157,15 @@
     >>> isinstance({}, reify(AnyJSON))
     True

+    >>> isinstance({}, reify(JSON))
+    True
+
+    >>> isinstance([], reify(JSON))
+    False
+
+    >>> isinstance([], reify(JSONs))
+    True
+
     >>> from collections import Counter
     >>> issubclass(Counter, reify(AnyJSON))
     True
@@ -188,21 +197,24 @@
     """
     # While `int | str` constructs a `UnionType` instance, `Union[str, int]`
     # constructs an instance of `Union`, so we need to handle both.
-    if get_origin(t) in (UnionType, Union):
+    origin = get_origin(t)
+    if origin in (UnionType, Union):
         def f(t):
             for a in get_args(t):
-                if get_origin(a) in (UnionType, Union):
+                o = get_origin(a)
+                if o in (UnionType, Union):
                     # handle Union of Union
                     yield from f(a)
                 else:
-                    o = get_origin(a)
                     yield a if o is None else o

         return tuple(OrderedSet(f(t)))
-    elif t.__module__ != 'typing':
-        return t
-    else:
+    elif origin is not None:
+        return origin
+    elif t.__module__ == 'typing':
         raise ValueError('Not a reifiable generic type', t)
+    else:
+        return t

 def get_generic_type_params(cls: type[Generic],
Index: test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tables.tdr.json
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tables.tdr.json b/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tables.tdr.json
--- a/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tables.tdr.json    (revision 1d654657258bff023c6f46a9d74f7412b71a4b0a)
+++ b/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tables.tdr.json    (date 1699949219040)
@@ -418,7 +418,7 @@
                         "sha256": "125e5c181744e2dacf0b156d4c0b82ce771701947b26ca738689edddfc3af97e",
                         "s3_etag": "31985ad7f32b053fa97f05216c6e805e-20",
                         "file_name": "IDC9_L002_R1.fastq.gz",
-                        "file_version": "2019-09-24T09:35:08.057668Z",
+                        "file_version": "2019-09-24T09:35:08.157668Z",
                         "file_id": "292b2faf-0db3-4ba1-a6a5-cb7bdfa9313d",
                         "content_type": "application/gzip",
                         "size": 1297795360
@@ -633,7 +633,7 @@
                         "sha256": "8af97d885b995e7239a409f490e915f4754e7d1902f4a620f464995a422fb61f",
                         "s3_etag": "20497e6cbc5f671fa94cf69bf0febdad-7",
                         "file_name": "IDC9_L004_I1.fastq.gz",
-                        "file_version": "2019-09-24T09:35:09.072703Z",
+                        "file_version": "2019-09-24T09:35:09.172703Z",
                         "file_id": "f0011946-dbea-4f87-9858-5e3fd32a9829",
                         "content_type": "application/gzip",
                         "size": 439983276
@@ -861,7 +861,7 @@
             "rows": [
                 {
                     "library_preparation_protocol_id": "2945bb1f-90de-42a3-afa1-f57a62c853f0",
-                    "version": "2019-09-20T13:43:52.078000Z",
+                    "version": "2019-09-20T13:43:52.178000Z",
                     "content": {
                         "describedBy": "https://schema.humancellatlas.org/type/protocol/sequencing/6.2.0/library_preparation_protocol",
                         "schema_type": "protocol",
@@ -940,7 +940,7 @@
             "rows": [
                 {
                     "dissociation_protocol_id": "eaf15851-97e3-4e4b-b81b-0e625098f4d5",
-                    "version": "2019-09-20T13:43:52.077000Z",
+                    "version": "2019-09-20T13:43:52.177000Z",
                     "content": {
                         "describedBy": "https://schema.humancellatlas.org/type/protocol/biomaterial_collection/6.2.0/dissociation_protocol",
                         "schema_type": "protocol",
@@ -1388,4 +1388,4 @@
             ]
         }
     }
-}
\ No newline at end of file
+}
Index: test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tdr.hca.json
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tdr.hca.json b/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tdr.hca.json
--- a/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tdr.hca.json   (revision 1d654657258bff023c6f46a9d74f7412b71a4b0a)
+++ b/test/indexer/data/1b6d8348-d6e9-406a-aa6a-7ee886e52bf9.tdr.hca.json   (date 1699949111722)
@@ -157,7 +157,7 @@
         {
             "name": "IDC9_L002_R1.fastq.gz",
             "uuid": "292b2faf-0db3-4ba1-a6a5-cb7bdfa9313d",
-            "version": "2019-09-24T09:35:08.057668Z",
+            "version": "2019-09-24T09:35:08.157668Z",
             "content-type": "application/gzip; dcp-type=data",
             "size": 1297795360,
             "indexed": false,
@@ -171,7 +171,7 @@
         {
             "name": "library_preparation_protocol_0.json",
             "uuid": "2945bb1f-90de-42a3-afa1-f57a62c853f0",
-            "version": "2019-09-20T13:43:52.078000Z",
+            "version": "2019-09-20T13:43:52.178000Z",
             "content-type": "application/json; dcp-type=\"metadata/protocol\"",
             "size": 931,
             "indexed": true,
@@ -498,7 +498,7 @@
         {
             "name": "dissociation_protocol_0.json",
             "uuid": "eaf15851-97e3-4e4b-b81b-0e625098f4d5",
-            "version": "2019-09-20T13:43:52.077000Z",
+            "version": "2019-09-20T13:43:52.177000Z",
             "content-type": "application/json; dcp-type=\"metadata/protocol\"",
             "size": 702,
             "indexed": true,
@@ -523,7 +523,7 @@
         {
             "name": "IDC9_L004_I1.fastq.gz",
             "uuid": "f0011946-dbea-4f87-9858-5e3fd32a9829",
-            "version": "2019-09-24T09:35:09.072703Z",
+            "version": "2019-09-24T09:35:09.172703Z",
             "content-type": "application/gzip; dcp-type=data",
             "size": 439983276,
             "indexed": false,
@@ -1572,4 +1572,4 @@
             }
         }
     }
-}
\ No newline at end of file
+}

There's a bug in the emulator which removes leading zeros in the microseconds part of timestamp. The patch addresses that by eliminating any such leading zeros in the version or file_version properties in the cans. The tests need to be run individually because they don't delete the tables during tearDown. The patch also removes some direct table manipulation in one of the tests; look for Directly modify the canned tables to test invalid links not present.

I had to apply the following patch to BQE to build it locally:

Index: Makefile
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/Makefile b/Makefile
--- a/Makefile  (revision 8ccde288d63846122e085433e0a03edfadba361c)
+++ b/Makefile  (date 1699943060496)
@@ -1,8 +1,13 @@
 VERSION ?= latest
 REVISION := $(shell git rev-parse --short HEAD)
 UNAME_OS := $(shell uname -s)
-ifneq ($(UNAME_OS),Darwin)
-   STATIC_LINK_FLAG := -linkmode external -extldflags "-static"
+UNAME_ARCH := $(shell uname -m)
+STATIC_LINK_FLAG := -linkmode external -extldflags "-static"
+ifeq ($(UNAME_OS),Darwin)
+   STATIC_LINK_FLAG :=
+endif
+ifneq (,$(filter $(UNAME_ARCH),arm64 aarch64))
+   STATIC_LINK_FLAG :=
 endif

 emulator/build:

The build command is docker buildx build --tag 122796619775.dkr.ecr.us-east-1.amazonaws.com/ghcr.io/goccy/bigquery-emulator:arm64 and it takes a while 15-20 min. On amd64 a custom will not be needed. I initially thought that I needed it for arm64 on my M1 Mac because I saw BQE panic in early tests and assumed that it was due to the architecture mismatch. That turned out to be false so the only advantage from building for arm64 is the performance improvement which I did not measure since I never tried with the stock amd64 image from DockerHub.

Next step is to determine how much work it would be to restore the missing can for TestAnvilIndexer.test_fetch_bundle.

… and to come up with a simple reproduction of the BQE bug with leading zeros in timestamp microseconds so we can file an issue against BQE.

Assignee to locate missing can.

I cannot find the missing can anywhere in the git history. The test has always been disabled, so I suspect the can never existed, or at least was never committed to develop.

Assignee to determine next steps.

A new can will have to be created. I made this part of the #2693 epic because we have a coverage blind spot for AnVIL in general and it would be reckless to move forward on verbatim manifests without any test coverage of AnVIL in general, and PFB manifests for AnVIL in general.

I'd like to take a stab at this.

Partial, merged PR #5833 switches from Tinyquery to bigquery-emulator. The approved, but not yet merged PR #5862 increases test coverage to include the stitching query. There are now two FIXMEs left referring to this issue:

Assignee to resolve both of them, either in a single PR or in separate ones.

Until PR https://github.com/DataBiosphere/azul/pull/5862 is merged, the feature branch(es) for the PR(s) resolving the above FIXMEs should probably be based on develop but include the commits from the PR branch from #5862, at least temporarily, until that PR has been merged.

Security review: The first PR (#5833) adds a Docker image. The image is built on GitHub Actions and pushed to GitHub's registry at ghcr.io. Like all other images used by Azul, it is mirrored to ECR and scanned by Amazon Inspector. Docker Scout reports 1 M and 27 L vulnerabilities.

Inspector reports 1 H and 1 M.

I've created https://github.com/DataBiosphere/azul/issues/5905 to ensure that the image is maintained as part of the biweekly upgrades.

For demo show increased unit test coverage on codecov.io.

Follow-up work: https://github.com/DataBiosphere/azul/issues/5934, https://github.com/DataBiosphere/azul/issues/5935

DataBiosphere / azul

tdr_anvil's fetch_bundle lacks test coverage #5046