DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Indexer ignores inconsistent project inner entities #6552

Open dsotirho-ucsc opened 2 months ago

dsotirho-ucsc commented 2 months ago

The hits[].projects values in a /index/{entity_type} response come from one bundle per hit, and are not an aggregate from all the bundles for a given project.

For example, imagine multiple bundles for a project, each adding a new file. Also imagine that each of these bundles has differing project metadata. The /index/files response for this project will return with each hit having different hits[].projects.… values than the other hits.

Ideally the hits[].projects values would be the same for all hits, and be an aggregate from all bundles for the project.

The patch below modifies an existing test (TestSchemaTestDataCannedBundle.test_project_cell_count) and its canned bundles to demonstrate the issue.

Index: test/service/test_response.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/service/test_response.py b/test/service/test_response.py
--- a/test/service/test_response.py (revision f42a38ece0f5a61dac81daa6942492145a9fdf2e)
+++ b/test/service/test_response.py (date 1725918186731)
@@ -2502,6 +2502,13 @@
                 actual_cell_counts = []
                 for hit in response_json['hits']:
                     project = one(hit['projects'])
+                    self.assertEqual(project['projectShortname'],
+                                     [
+                                         'Covid19PBMC',
+                                         'Covid19PBMC_a',
+                                         'Covid19PBMC_b',
+                                         'Covid19PBMC_c'
+                                         ])
                     actual_cell_counts.append(project['estimatedCellCount'])
                 self.assertEqual(expected_cell_counts[entity_type],
                                  actual_cell_counts)
Index: test/indexer/data/1f6afb64-fa14-5c6f-a474-a742540108a3.dss.hca.json
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/data/1f6afb64-fa14-5c6f-a474-a742540108a3.dss.hca.json b/test/indexer/data/1f6afb64-fa14-5c6f-a474-a742540108a3.dss.hca.json
--- a/test/indexer/data/1f6afb64-fa14-5c6f-a474-a742540108a3.dss.hca.json   (revision f42a38ece0f5a61dac81daa6942492145a9fdf2e)
+++ b/test/indexer/data/1f6afb64-fa14-5c6f-a474-a742540108a3.dss.hca.json   (date 1725910164699)
@@ -107,7 +107,7 @@
             ],
             "project_core": {
                 "project_description": "The COVID-19 pandemic, caused by SARS coronavirus 2 (SARS-CoV-2), has resulted in excess morbidity and mortality as well as economic decline. To characterise the systemic host immune response to SARS-CoV-2, we performed single-cell RNA-sequencing coupled with analysis of cell surface proteins, providing molecular profiling of over 800,000 peripheral blood mononuclear cells from a cohort of 130 patients with COVID-19. Our cohort, from three UK centres, spans the spectrum of clinical presentations and disease severities ranging from asymptomatic to critical. Three control groups were included: healthy volunteers, patients suffering from a non-COVID-19 severe respiratory illness and healthy individuals administered with intravenous lipopolysaccharide to model an acute inflammatory response. Full single cell transcriptomes coupled with quantification of 188 cell surface proteins, and T and B lymphocyte antigen receptor repertoires have provided several insights into COVID-19: 1. a new non-classical monocyte state that sequesters platelets and replenishes the alveolar macrophage pool; 2. platelet activation accompanied by early priming towards megakaryopoiesis in immature haematopoietic stem/progenitor cells and expansion of megakaryocyte-primed progenitors; 3. increased clonally expanded CD8+ effector:effector memory T cells, and proliferating CD4+ and CD8+ T cells in patients with more severe disease; and 4. relative increase of IgA plasmablasts in asymptomatic stages that switches to expansion of IgG plasmablasts and plasma cells, accompanied with higher incidence of BCR sharing, as disease severity increases. All data and analysis results are available for interrogation and data mining through an intuitive web portal. Together, these data detail the cellular processes present in peripheral blood during an acute immune response to COVID-19, and serve as a template for multi-omic single cell data integration across multiple centers to rapidly build powerful resources to help combat diseases such as COVID-19.",
-                "project_short_name": "Covid19PBMC",
+                "project_short_name": "Covid19PBMC_a",
                 "project_title": "The cellular immune response to COVID-19 deciphered by single cell multi-omics across three UK centres"
             },
             "provenance": {
Index: test/indexer/data/3ac62c33-93e1-56b4-b857-59497f5d942d.dss.hca.json
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/data/3ac62c33-93e1-56b4-b857-59497f5d942d.dss.hca.json b/test/indexer/data/3ac62c33-93e1-56b4-b857-59497f5d942d.dss.hca.json
--- a/test/indexer/data/3ac62c33-93e1-56b4-b857-59497f5d942d.dss.hca.json   (revision f42a38ece0f5a61dac81daa6942492145a9fdf2e)
+++ b/test/indexer/data/3ac62c33-93e1-56b4-b857-59497f5d942d.dss.hca.json   (date 1725910164699)
@@ -107,7 +107,7 @@
             ],
             "project_core": {
                 "project_description": "The COVID-19 pandemic, caused by SARS coronavirus 2 (SARS-CoV-2), has resulted in excess morbidity and mortality as well as economic decline. To characterise the systemic host immune response to SARS-CoV-2, we performed single-cell RNA-sequencing coupled with analysis of cell surface proteins, providing molecular profiling of over 800,000 peripheral blood mononuclear cells from a cohort of 130 patients with COVID-19. Our cohort, from three UK centres, spans the spectrum of clinical presentations and disease severities ranging from asymptomatic to critical. Three control groups were included: healthy volunteers, patients suffering from a non-COVID-19 severe respiratory illness and healthy individuals administered with intravenous lipopolysaccharide to model an acute inflammatory response. Full single cell transcriptomes coupled with quantification of 188 cell surface proteins, and T and B lymphocyte antigen receptor repertoires have provided several insights into COVID-19: 1. a new non-classical monocyte state that sequesters platelets and replenishes the alveolar macrophage pool; 2. platelet activation accompanied by early priming towards megakaryopoiesis in immature haematopoietic stem/progenitor cells and expansion of megakaryocyte-primed progenitors; 3. increased clonally expanded CD8+ effector:effector memory T cells, and proliferating CD4+ and CD8+ T cells in patients with more severe disease; and 4. relative increase of IgA plasmablasts in asymptomatic stages that switches to expansion of IgG plasmablasts and plasma cells, accompanied with higher incidence of BCR sharing, as disease severity increases. All data and analysis results are available for interrogation and data mining through an intuitive web portal. Together, these data detail the cellular processes present in peripheral blood during an acute immune response to COVID-19, and serve as a template for multi-omic single cell data integration across multiple centers to rapidly build powerful resources to help combat diseases such as COVID-19.",
-                "project_short_name": "Covid19PBMC",
+                "project_short_name": "Covid19PBMC_b",
                 "project_title": "The cellular immune response to COVID-19 deciphered by single cell multi-omics across three UK centres"
             },
             "provenance": {
Index: test/indexer/data/4da04038-adab-59a9-b6c4-3a61242cc972.dss.hca.json
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/data/4da04038-adab-59a9-b6c4-3a61242cc972.dss.hca.json b/test/indexer/data/4da04038-adab-59a9-b6c4-3a61242cc972.dss.hca.json
--- a/test/indexer/data/4da04038-adab-59a9-b6c4-3a61242cc972.dss.hca.json   (revision f42a38ece0f5a61dac81daa6942492145a9fdf2e)
+++ b/test/indexer/data/4da04038-adab-59a9-b6c4-3a61242cc972.dss.hca.json   (date 1725910164700)
@@ -107,7 +107,7 @@
             ],
             "project_core": {
                 "project_description": "The COVID-19 pandemic, caused by SARS coronavirus 2 (SARS-CoV-2), has resulted in excess morbidity and mortality as well as economic decline. To characterise the systemic host immune response to SARS-CoV-2, we performed single-cell RNA-sequencing coupled with analysis of cell surface proteins, providing molecular profiling of over 800,000 peripheral blood mononuclear cells from a cohort of 130 patients with COVID-19. Our cohort, from three UK centres, spans the spectrum of clinical presentations and disease severities ranging from asymptomatic to critical. Three control groups were included: healthy volunteers, patients suffering from a non-COVID-19 severe respiratory illness and healthy individuals administered with intravenous lipopolysaccharide to model an acute inflammatory response. Full single cell transcriptomes coupled with quantification of 188 cell surface proteins, and T and B lymphocyte antigen receptor repertoires have provided several insights into COVID-19: 1. a new non-classical monocyte state that sequesters platelets and replenishes the alveolar macrophage pool; 2. platelet activation accompanied by early priming towards megakaryopoiesis in immature haematopoietic stem/progenitor cells and expansion of megakaryocyte-primed progenitors; 3. increased clonally expanded CD8+ effector:effector memory T cells, and proliferating CD4+ and CD8+ T cells in patients with more severe disease; and 4. relative increase of IgA plasmablasts in asymptomatic stages that switches to expansion of IgG plasmablasts and plasma cells, accompanied with higher incidence of BCR sharing, as disease severity increases. All data and analysis results are available for interrogation and data mining through an intuitive web portal. Together, these data detail the cellular processes present in peripheral blood during an acute immune response to COVID-19, and serve as a template for multi-omic single cell data integration across multiple centers to rapidly build powerful resources to help combat diseases such as COVID-19.",
-                "project_short_name": "Covid19PBMC",
+                "project_short_name": "Covid19PBMC_c",
                 "project_title": "The cellular immune response to COVID-19 deciphered by single cell multi-omics across three UK centres"
             },
             "provenance": {

Console log:

2024-09-09 15:22:50,584    INFO Thread-1 test.app_test_case: Serving on http://127.0.0.1:53714
2024-09-09 15:22:50,586    INFO Thread-2 (process_request_thread) azul.chalice: Received GET request for '/health/basic', with {"query": null, "headers": {"host": "127.0.0.1:53714", "user-agent": "python-requests/2.32.2", "accept-encoding": "gzip, deflate, br", "accept": "*/*", "connection": "keep-alive"}}.
2024-09-09 15:22:50,586    INFO Thread-2 (process_request_thread) azul.chalice: Did not authenticate request.
2024-09-09 15:22:50,586   DEBUG Thread-2 (process_request_thread) azul.chalice: Returning 200 response with headers {"Access-Control-Allow-Origin": "*", "Access-Control-Allow-Headers": "Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key", "Strict-Transport-Security": "max-age=31536000; includeSubDomains", "X-Content-Type-Options": "nosniff", "X-Frame-Options": "DENY", "Cache-Control": "no-store"}. See next line for the first 1024 characters of the body.
{"up": true}
127.0.0.1 - - [09/Sep/2024 15:22:50] "GET /health/basic HTTP/1.1" 200 -
2024-09-09 15:22:50,587    INFO Thread-3 (process_request_thread) azul.chalice: Received GET request for '/index/files', with {"query": {"catalog": "test"}, "headers": {"host": "127.0.0.1:53714", "user-agent": "python-requests/2.32.2", "accept-encoding": "gzip, deflate, br", "accept": "*/*", "connection": "keep-alive"}}.
2024-09-09 15:22:50,587    INFO Thread-3 (process_request_thread) azul.chalice: Did not authenticate request.
2024-09-09 15:22:50,590    INFO Thread-3 (process_request_thread) elasticsearch: Making POST request to http://127.0.0.1:53704/azul_v2_dummy_test_files_aggregate/_search
2024-09-09 15:22:50,590    INFO Thread-3 (process_request_thread) elasticsearch: … with request body b'{"post_filter":{"bool":{"must":[{"constant_score":{"filter":{"terms":{"sources.id.keyword":["42848d8f-ecdc-5b32-a667-a7b5aedf...'
2024-09-09 15:22:50,635    INFO Thread-3 (process_request_thread) elasticsearch: Got 200 response after 0.045s from POST to http://127.0.0.1:53704/azul_v2_dummy_test_files_aggregate/_search
2024-09-09 15:22:50,635    INFO Thread-3 (process_request_thread) elasticsearch: … with response body '{"took":37,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":10,"relation"…'
2024-09-09 15:22:50,639   DEBUG Thread-3 (process_request_thread) azul.chalice: Returning 200 response with headers {"Access-Control-Allow-Origin": "*", "Access-Control-Allow-Headers": "Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key", "Strict-Transport-Security": "max-age=31536000; includeSubDomains", "X-Content-Type-Options": "nosniff", "X-Frame-Options": "DENY", "Cache-Control": "no-store"}. See next line for the first 1024 characters of the body.
{"pagination": {"count": 10, "total": 10, "size": 10, "next": null, "previous": null, "pages": 1, "sort": "fileName", "order": "asc"}, "termFacets": {"organ": {"terms": [{"term": "blood", "count": 10}], "total": 10, "type": "terms"}, "sampleEntityType": {"terms": [{"term": "specimens", "count": 10}], "total": 10, "type": "terms"}, "project": {"terms": [{"term": "Covid19PBMC", "count": 7, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}, {"term": "Covid19PBMC_a", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}, {"term": "Covid19PBMC_b", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}, {"term": "Covid19PBMC_c", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}], "total": 10, "type": "terms"}, "sampleDisease": {"terms": [{"term": "COVID-19", "count": 10}], "total": 10, "type": "terms"}, "nucleicAcidSource": {"terms": [{"term": "single cell", "count": 5}, {"term": "single nucleus", "count": 4}, {"term": null, "count": 1}], "total": 10, "type": "terms"},
127.0.0.1 - - [09/Sep/2024 15:22:50] "GET /index/files?catalog=test HTTP/1.1" 200 -
SubTest failure: Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.11.9/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/Users/daniel/.pyenv/versions/3.11.9/lib/python3.11/unittest/case.py", line 538, in subTest
    yield
  File "/Users/daniel/repo/azul1/test/service/test_response.py", line 2505, in test_project_cell_count
    self.assertEqual(project['projectShortname'],
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pycharm/teamcity/diff_tools.py", line 33, in _patched_equals
    old(self, first, second, msg)
AssertionError: Lists differ: ['Covid19PBMC'] != ['Covid19PBMC', 'Covid19PBMC_a', 'Covid19PBMC_b', 'Covid19PBMC_c']

Second list contains 3 additional elements.
First extra element 1:
'Covid19PBMC_a'

- ['Covid19PBMC']
+ ['Covid19PBMC', 'Covid19PBMC_a', 'Covid19PBMC_b', 'Covid19PBMC_c']

2024-09-09 15:22:50,644    INFO Thread-4 (process_request_thread) azul.chalice: Received GET request for '/index/samples', with {"query": {"catalog": "test"}, "headers": {"host": "127.0.0.1:53714", "user-agent": "python-requests/2.32.2", "accept-encoding": "gzip, deflate, br", "accept": "*/*", "connection": "keep-alive"}}.
2024-09-09 15:22:50,644    INFO Thread-4 (process_request_thread) azul.chalice: Did not authenticate request.
2024-09-09 15:22:50,646    INFO Thread-4 (process_request_thread) elasticsearch: Making POST request to http://127.0.0.1:53704/azul_v2_dummy_test_samples_aggregate/_search
2024-09-09 15:22:50,646    INFO Thread-4 (process_request_thread) elasticsearch: … with request body b'{"post_filter":{"bool":{"must":[{"constant_score":{"filter":{"terms":{"sources.id.keyword":["42848d8f-ecdc-5b32-a667-a7b5aedf...'
2024-09-09 15:22:50,667    INFO Thread-4 (process_request_thread) elasticsearch: Got 200 response after 0.021s from POST to http://127.0.0.1:53704/azul_v2_dummy_test_samples_aggregate/_search
2024-09-09 15:22:50,667    INFO Thread-4 (process_request_thread) elasticsearch: … with response body '{"took":15,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3,"relation":…'
2024-09-09 15:22:50,668   DEBUG Thread-4 (process_request_thread) azul.chalice: Returning 200 response with headers {"Access-Control-Allow-Origin": "*", "Access-Control-Allow-Headers": "Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key", "Strict-Transport-Security": "max-age=31536000; includeSubDomains", "X-Content-Type-Options": "nosniff", "X-Frame-Options": "DENY", "Cache-Control": "no-store"}. See next line for the first 1024 characters of the body.
{"pagination": {"count": 3, "total": 3, "size": 10, "next": null, "previous": null, "pages": 1, "sort": "sampleId", "order": "asc"}, "termFacets": {"organ": {"terms": [{"term": "blood", "count": 3}], "total": 3, "type": "terms"}, "sampleEntityType": {"terms": [{"term": "specimens", "count": 3}], "total": 3, "type": "terms"}, "project": {"terms": [{"term": "Covid19PBMC", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}, {"term": "Covid19PBMC_a", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}, {"term": "Covid19PBMC_c", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}], "total": 3, "type": "terms"}, "sampleDisease": {"terms": [{"term": "COVID-19", "count": 3}], "total": 3, "type": "terms"}, "nucleicAcidSource": {"terms": [{"term": "single cell", "count": 2}, {"term": "single nucleus", "count": 1}], "total": 3, "type": "terms"}, "assayType": {"terms": [{"term": null, "count": 3}], "total": 0, "type": "terms"}, "instrumentManufacturerModel": {"terms": [{"term
127.0.0.1 - - [09/Sep/2024 15:22:50] "GET /index/samples?catalog=test HTTP/1.1" 200 -
SubTest failure: Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.11.9/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/Users/daniel/.pyenv/versions/3.11.9/lib/python3.11/unittest/case.py", line 538, in subTest
    yield
  File "/Users/daniel/repo/azul1/test/service/test_response.py", line 2505, in test_project_cell_count
    self.assertEqual(project['projectShortname'],
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pycharm/teamcity/diff_tools.py", line 33, in _patched_equals
    old(self, first, second, msg)
AssertionError: Lists differ: ['Covid19PBMC_c'] != ['Covid19PBMC', 'Covid19PBMC_a', 'Covid19PBMC_b', 'Covid19PBMC_c']

First differing element 0:
'Covid19PBMC_c'
'Covid19PBMC'

Second list contains 3 additional elements.
First extra element 1:
'Covid19PBMC_a'

- ['Covid19PBMC_c']
+ ['Covid19PBMC', 'Covid19PBMC_a', 'Covid19PBMC_b', 'Covid19PBMC_c']

2024-09-09 15:22:50,670    INFO Thread-5 (process_request_thread) azul.chalice: Received GET request for '/index/projects', with {"query": {"catalog": "test"}, "headers": {"host": "127.0.0.1:53714", "user-agent": "python-requests/2.32.2", "accept-encoding": "gzip, deflate, br", "accept": "*/*", "connection": "keep-alive"}}.
2024-09-09 15:22:50,670    INFO Thread-5 (process_request_thread) azul.chalice: Did not authenticate request.
2024-09-09 15:22:50,672    INFO Thread-5 (process_request_thread) elasticsearch: Making POST request to http://127.0.0.1:53704/azul_v2_dummy_test_projects_aggregate/_search
2024-09-09 15:22:50,672    INFO Thread-5 (process_request_thread) elasticsearch: … with request body b'{"post_filter":{"bool":{}},"aggs":{"sourceId":{"filter":{"bool":{}},"aggs":{"myTerms":{"terms":{"field":"sources.id.keyword",...'
2024-09-09 15:22:50,685    INFO Thread-5 (process_request_thread) elasticsearch: Got 200 response after 0.013s from POST to http://127.0.0.1:53704/azul_v2_dummy_test_projects_aggregate/_search
2024-09-09 15:22:50,685    INFO Thread-5 (process_request_thread) elasticsearch: … with response body '{"took":9,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"…'
2024-09-09 15:22:50,686   DEBUG Thread-5 (process_request_thread) azul.chalice: Returning 200 response with headers {"Access-Control-Allow-Origin": "*", "Access-Control-Allow-Headers": "Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key", "Strict-Transport-Security": "max-age=31536000; includeSubDomains", "X-Content-Type-Options": "nosniff", "X-Frame-Options": "DENY", "Cache-Control": "no-store"}. See next line for the first 1024 characters of the body.
{"pagination": {"count": 1, "total": 1, "size": 10, "next": null, "previous": null, "pages": 1, "sort": "projectTitle", "order": "asc"}, "termFacets": {"organ": {"terms": [{"term": "blood", "count": 1}], "total": 1, "type": "terms"}, "sampleEntityType": {"terms": [{"term": "specimens", "count": 1}], "total": 1, "type": "terms"}, "project": {"terms": [{"term": "Covid19PBMC", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}], "total": 1, "type": "terms"}, "sampleDisease": {"terms": [{"term": "COVID-19", "count": 1}], "total": 1, "type": "terms"}, "nucleicAcidSource": {"terms": [{"term": "single cell", "count": 1}, {"term": "single nucleus", "count": 1}], "total": 1, "type": "terms"}, "assayType": {"terms": [{"term": null, "count": 1}], "total": 0, "type": "terms"}, "instrumentManufacturerModel": {"terms": [{"term": "EFO_0008637", "count": 1}], "total": 1, "type": "terms"}, "institution": {"terms": [{"term": "Newcastle University", "count": 1}], "total": 1, "type": "terms"}, "donorDisease": {"t
127.0.0.1 - - [09/Sep/2024 15:22:50] "GET /index/projects?catalog=test HTTP/1.1" 200 -
SubTest failure: Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.11.9/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/Users/daniel/.pyenv/versions/3.11.9/lib/python3.11/unittest/case.py", line 538, in subTest
    yield
  File "/Users/daniel/repo/azul1/test/service/test_response.py", line 2505, in test_project_cell_count
    self.assertEqual(project['projectShortname'],
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pycharm/teamcity/diff_tools.py", line 33, in _patched_equals
    old(self, first, second, msg)
AssertionError: 'Covid19PBMC' != ['Covid19PBMC', 'Covid19PBMC_a', 'Covid19PBMC_b', 'Covid19PBMC_c']

2024-09-09 15:22:50,689    INFO Thread-6 (process_request_thread) azul.chalice: Received GET request for '/index/bundles', with {"query": {"catalog": "test"}, "headers": {"host": "127.0.0.1:53714", "user-agent": "python-requests/2.32.2", "accept-encoding": "gzip, deflate, br", "accept": "*/*", "connection": "keep-alive"}}.
2024-09-09 15:22:50,689    INFO Thread-6 (process_request_thread) azul.chalice: Did not authenticate request.
2024-09-09 15:22:50,692    INFO Thread-6 (process_request_thread) elasticsearch: Making POST request to http://127.0.0.1:53704/azul_v2_dummy_test_bundles_aggregate/_search
2024-09-09 15:22:50,692    INFO Thread-6 (process_request_thread) elasticsearch: … with request body b'{"post_filter":{"bool":{"must":[{"constant_score":{"filter":{"terms":{"sources.id.keyword":["42848d8f-ecdc-5b32-a667-a7b5aedf...'
2024-09-09 15:22:50,708    INFO Thread-6 (process_request_thread) elasticsearch: Got 200 response after 0.016s from POST to http://127.0.0.1:53704/azul_v2_dummy_test_bundles_aggregate/_search
2024-09-09 15:22:50,708    INFO Thread-6 (process_request_thread) elasticsearch: … with response body '{"took":12,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":7,"relation":…'
2024-09-09 15:22:50,711   DEBUG Thread-6 (process_request_thread) azul.chalice: Returning 200 response with headers {"Access-Control-Allow-Origin": "*", "Access-Control-Allow-Headers": "Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key", "Strict-Transport-Security": "max-age=31536000; includeSubDomains", "X-Content-Type-Options": "nosniff", "X-Frame-Options": "DENY", "Cache-Control": "no-store"}. See next line for the first 1024 characters of the body.
{"pagination": {"count": 7, "total": 7, "size": 10, "next": null, "previous": null, "pages": 1, "sort": "bundleVersion", "order": "desc"}, "termFacets": {"organ": {"terms": [{"term": "blood", "count": 7}], "total": 7, "type": "terms"}, "sampleEntityType": {"terms": [{"term": "specimens", "count": 7}], "total": 7, "type": "terms"}, "project": {"terms": [{"term": "Covid19PBMC", "count": 4, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}, {"term": "Covid19PBMC_a", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}, {"term": "Covid19PBMC_b", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}, {"term": "Covid19PBMC_c", "count": 1, "projectId": ["90bf705c-d891-5ce2-aa54-094488b445c6"]}], "total": 7, "type": "terms"}, "sampleDisease": {"terms": [{"term": "COVID-19", "count": 7}], "total": 7, "type": "terms"}, "nucleicAcidSource": {"terms": [{"term": "single cell", "count": 4}, {"term": "single nucleus", "count": 2}, {"term": null, "count": 1}], "total": 7, "type": "terms"}, "as
127.0.0.1 - - [09/Sep/2024 15:22:50] "GET /index/bundles?catalog=test HTTP/1.1" 200 -
SubTest failure: Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.11.9/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/Users/daniel/.pyenv/versions/3.11.9/lib/python3.11/unittest/case.py", line 538, in subTest
    yield
  File "/Users/daniel/repo/azul1/test/service/test_response.py", line 2505, in test_project_cell_count
    self.assertEqual(project['projectShortname'],
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pycharm/teamcity/diff_tools.py", line 33, in _patched_equals
    old(self, first, second, msg)
AssertionError: Lists differ: ['Covid19PBMC'] != ['Covid19PBMC', 'Covid19PBMC_a', 'Covid19PBMC_b', 'Covid19PBMC_c']

Second list contains 3 additional elements.
First extra element 1:
'Covid19PBMC_a'

- ['Covid19PBMC']
+ ['Covid19PBMC', 'Covid19PBMC_a', 'Covid19PBMC_b', 'Covid19PBMC_c']

2024-09-09 15:22:50,712   DEBUG MainThread test.app_test_case: Tearing down server thread …
2024-09-09 15:22:51,196    INFO MainThread elasticsearch: Making HEAD request to http://127.0.0.1:53704/azul_v2_dummy_test_files
2024-09-09 15:22:51,196    INFO MainThread elasticsearch: … without request body

One or more subtests failed
Failed subtests list: (entity_type='files'), (entity_type='samples'), (entity_type='projects'), (entity_type='bundles')
hannes-ucsc commented 2 months ago

Each bundle contributes different files. The project inner entity in those contributions are inconsistent but there is only one contribution to each file. Azul only considers contributions to any given outer entity (file, in this case), there are no inconsistencies among the contributions to that file, since there is only one such contribution.

Each bundle also contributes to the project outer entity, and there are inconsistencies among the latest contributions to that entity, so the indexer should detect that and fail. It seems that this is a bug in reconcile_inner_entities and we should test that assumption by applying the patch and stepping through that method. It's also possible that that method isn't even invoked when the inner entity type equals that of the outer entity. That would also constitute a bug.

At the moment we don't observe these inconsistencies in the wild, only in cans that we modified inconsistently, so this is lower priority.