DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
6 stars 2 forks source link

Non-determinisms in service responses #3891

Open hannes-ucsc opened 2 years ago

hannes-ucsc commented 2 years ago

Comparing prod and prod2 as part of #3782, @achave11 found some insignificant differences in the /index/project and /index/summary endpoint responses for otherwise identical catalogs. I think it was the ordering of contributors. To be confirmed.

melainalegaspi commented 2 years ago

Spike to identify the insignificant differences between the dumps for dcp12 from prod and prod2.

hannes-ucsc commented 2 years ago

The summary responses are identical for dcp1 and dcp12. For dcp13, there are significant differences in the summary response AND insignificant ones. So it seems that the non-determinism in the summary response is conditional upon the presence of significant differences. @achave11 will post the summary dumps for dcp13 to illustrate.

achave11-ucsc commented 2 years ago

prod-dcp13-summary-dump.json.zip prod2-dcp13-summary-dump.json.zip

achave11-ucsc commented 2 years ago

Another cause for the non-determinism was also the missing/required projects as addressed by #3896.

melainalegaspi commented 2 years ago

@achave11 to spike again and post index/projects dumps for dcp12 in prod and prod2.

achave11-ucsc commented 2 years ago

Project dumps for catalog dcp12 in both prod and prod2 … projects-dcp12-prod.json.zip projects-dcp12-prod2.json.zip … as Hannes identify, the order of contributors is not identical.

$ diff projects-dcp12-prod.json projects-dcp12-prod2.json | head -n 38
31,32c31,32
<           "sourceId": "a71a8575-1ba0-4895-b8b9-3685a8c056d1",
<           "sourceSpec": "tdr:tdr-fp-fea71bda:snapshot/hca_prod_20201120_dcp2___20211213_dcp12:/2"
---
>           "sourceId": "dee17f6d-8c5c-4f0f-b692-1277be521c91",
>           "sourceSpec": "tdr:datarepo-a1c89fba:snapshot/hca_prod_005d611a14d54fbf846e571a1f874f70__20220111_dcp2_20220113_dcp12:/0"
61,66c61,66
<               "institution": "EMBL-EBI",
<               "contactName": "Mallory,Ann,Freeberg",
<               "projectRole": "data curator",
<               "laboratory": "Human Cell Atlas Data Coordination Platform",
<               "correspondingContributor": false,
<               "email": "mfreeberg@ebi.ac.uk"
---
>               "institution": "Max Planck Institute for Evolutionary Anthropology",
>               "contactName": "Barbara,,Treutlein",
>               "projectRole": "principal investigator",
>               "laboratory": null,
>               "correspondingContributor": true,
>               "email": "barbara_treutlein@eva.mpg.de"
76a77,84
>               "institution": "EMBL-EBI",
>               "contactName": "Mallory,Ann,Freeberg",
>               "projectRole": "data curator",
>               "laboratory": "Human Cell Atlas Data Coordination Platform",
>               "correspondingContributor": false,
>               "email": "mfreeberg@ebi.ac.uk"
>             },
>             {
83,90d90
<             },
<             {
<               "institution": "Max Planck Institute for Evolutionary Anthropology",
<               "contactName": "Barbara,,Treutlein",
<               "projectRole": "principal investigator",
<               "laboratory": null,
<               "correspondingContributor": true,
<               "email": "barbara_treutlein@eva.mpg.de"
achave11-ucsc commented 2 years ago

Running …

diff projects-dcp12-prod.json projects-dcp12-prod2.json | grep -v -e "source*" -e "service.azul*" | less -S

… removes the expected deterministic differences and exposes the non-deterministic ones.

melainalegaspi commented 2 years ago

@hannes-ucsc: "Extending spike to come up with list of suspected non-determinisms."

achave11-ucsc commented 2 years ago

Project dumps for catalog dcp1 in both prod and prod2 … projects-dcp1-prod.json.zip projects-dcp1-prod2.json.zip … contain a non-determinism related to the ordering of the contributors, like dcp12. The following illustrates a portion of diff, the same fields are replicated through the file.

$ diff projects-dcp1-prod.json projects-dcp1-prod2.json | grep -v -e "source*" -e "service.azul*" | less -S | head -n 25 | tail -n23
63,64c63,64
<               "contactName": "Barbara,,Treutlein",
<               "projectRole": "principal investigator",
---
>               "contactName": "Sabina,,Kanton",
>               "projectRole": null,
66,67c66,67
<               "correspondingContributor": true,
<               "email": "barbara_treutlein@eva.mpg.de"
---
>               "correspondingContributor": false,
>               "email": "sabina_kanton@eva.mpg.de"
79,80c79,80
<               "contactName": "Sabina,,Kanton",
<               "projectRole": null,
---
>               "contactName": "Barbara,,Treutlein",
>               "projectRole": "principal investigator",
82,83c82,83
<               "correspondingContributor": false,
<               "email": "sabina_kanton@eva.mpg.de"
---
>               "correspondingContributor": true,

Project dumps for catalog dcp13 in both prod and prod2 … projects-dcp13-prod.zip projects-dcp13-prod2.zip … contain a series of non-determinism related to the ordering of the contributors, publications, samples.id, specimens.id and ordering of protocols. Contributors…

$ diff projects-dcp13-1-prod.json projects-dcp13-1-prod2.json | grep -v -e "source*" -e "service.azul*" | head -n 21 | tail -n 11
69,72c69,72
<               "institution": "Max Planck Institute for Evolutionary Anthropology",
<               "contactName": "Sabina,,Kanton",
<               "projectRole": null,
<               "laboratory": null,
---
>               "institution": "EMBL-EBI",
>               "contactName": "Mallory,Ann,Freeberg",
>               "projectRole": "data curator",
>               "laboratory": "Human Cell Atlas Data Coordination Platform",
74c74

Publications…

$ diff projects-dcp13-1-prod.json projects-dcp13-1-prod2.json | grep -v -e "source*" -e "service.azul*" | head -n 170 | tail -n 141257,1262d1256
<               "publicationTitle": "Single-Cell RNA-Seq Reveals Lineage and X Chromosome Dynamics in Human Preimplantation Embryos",
<               "officialHcaPublication": false,
<               "publicationUrl": "https://www.cell.com/fulltext/S0092-8674(16)30280-X#secsectitle0115",
<               "doi": "https://doi.org/10.1016/j.cell.2016.03.023"
<             },
<             {
1266a1261,1266
>             },
>             {
>               "publicationTitle": "Single-Cell RNA-Seq Reveals Lineage and X Chromosome Dynamics in Human Preimplantation Embryos",
>               "officialHcaPublication": false,
>               "publicationUrl": "https://www.cell.com/fulltext/S0092-8674(16)30280-X#secsectitle0115",
>               "doi": "https://doi.org/10.1016/j.cell.2016.03.023"

Samples.Id…

$ diff projects-dcp13-1-prod.json projects-dcp13-1-prod2.json | grep -v -e "source*" -e "service.azul*" | head -n 5824 | tail -n 9<             "S_hESC_passage2310__Cell2313",
48545,48547c48545,48547
<             "S_hESC_passage2310__Cell2316",
<             "S_hESC_passage2310__Cell2321",
<             "S_hESC_passage2310__Cell233"
---
>             "S_hESC_passage2310__Cell2319",
>             "S_hESC_passage2310__Cell232",
>             "S_hESC_passage2310__Cell2321"

Specimens.Id…

$ diff projects-dcp13-1-prod.json projects-dcp13-1-prod2.json | grep -v -e "source*" -e "service.azul*" | head -n 9351 | tail -n 11
79323c79323
<             "sample144",
---
>             "sample145",
79325d79324
<             "sample158",
79328,79329c79327
<             "sample168",
<             "sample174",
---
>             "sample17",

Protocols…

$ diff projects-dcp13-1-prod.json projects-dcp13-1-prod2.json | grep -v -e "source*" -e "service.azul*" | head -n 3336 | tail -n 630233,30234c30233,30234
<           "lastModifiedDate": "2022-01-19T13:45:17.000000Z",
<           "submissionDate": "2022-01-19T13:45:17.000000Z",
---
>           "lastModifiedDate": "2021-11-15T19:11:22.000000Z",
>           "submissionDate": "2021-11-15T19:11:22.000000Z",
achave11-ucsc commented 2 years ago

@hannes-ucsc: "The protocol diff is worrying because it seems to indicate that the dates are chosen arbitrarily from one of the inner entities."