HumanCellAtlas / dcp2

Shared artifacts concerning the Human Cell Atlas (HCA) Data Coordination Platform (DCP)
4 stars 2 forks source link

TDR's enumerateSnapshots response lacks name of Google project #54

Closed hannes-ucsc closed 2 years ago

hannes-ucsc commented 2 years ago

The response to a request to the enumerateSnapshots endpoint does not include the name of the Google project that hosts the BQ tables:

$ curl -X GET "https://jade-terra.datarepo-prod.broadinstitute.org/api/repository/v1/snapshots?limit=1" -H "accept: application/json" -H "authorization: Bearer REDACTED" | jq
{
  "total": 22,
  "filteredTotal": 22,
  "items": [
    {
      "id": "ec40aa9f-43aa-4839-98e3-6362c96a0bee",
      "name": "hca_prod_20201120_dcp2___20201124",
      "description": "Create snapshot hca_prod_20201120_dcp2___20201124",
      "createdDate": "2020-11-24T19:41:37.611318Z",
      "profileId": "db61c343-6dfe-4d14-84e9-60ddf97ea73f",
      "storage": [
        {
          "region": "us",
          "cloudResource": "bigquery",
          "cloudPlatform": "gcp"
        },
        {
          "region": "us-central1",
          "cloudResource": "bucket",
          "cloudPlatform": "gcp"
        },
        {
          "region": "us-central1",
          "cloudResource": "firestore",
          "cloudPlatform": "gcp"
        }
      ]
    }
  ]
}

The retrieveSnapshot response does (note the dataProject property):

$ curl -X GET "https://jade-terra.datarepo-prod.broadinstitute.org/api/repository/v1/snapshots/ec40aa9f-43aa-4839-98e3-6362c96a0bee?include=SOURCES&include=DATA_PROJECT" -H "accept: application/json" -H "authorization: Bearer REDACTED" | jq
{
  "id": "ec40aa9f-43aa-4839-98e3-6362c96a0bee",
  "name": "hca_prod_20201120_dcp2___20201124",
  "description": "Create snapshot hca_prod_20201120_dcp2___20201124",
  "createdDate": "2020-11-24T19:41:37.611318Z",
  "source": [
    {
      "dataset": {
        "id": "d30e68f8-c826-4639-88f3-ae35f00d4185",
        "name": "hca_prod_20201120_dcp2",
        "description": "Human Cell Atlas",
        "defaultProfileId": "db61c343-6dfe-4d14-84e9-60ddf97ea73f",
        "createdDate": "2020-11-20T19:46:28.951142Z",
        "storage": [
          {
            "region": "us",
            "cloudResource": "bigquery",
            "cloudPlatform": "gcp"
          },
          {
            "region": "us-central1",
            "cloudResource": "bucket",
            "cloudPlatform": "gcp"
          },
          {
            "region": "us-central1",
            "cloudResource": "firestore",
            "cloudPlatform": "gcp"
          }
        ]
      },
      "asset": null
    }
  ],
  "tables": null,
  "relationships": null,
  "profileId": null,
  "dataProject": "broad-datarepo-terra-prod-hca2",
  "accessInformation": null
}

Azul needs the Google Project name to compose BQ queries against the tables in a snapshot. We also prefer to use the enumerateSnapshot endpoint to efficiently get information about multiple snapshots at once but the lack of the Google Project lack in the enumerateSnapshot response forces us also hit the retrieveSnapshot endpoint for each snapshot individually. So we currently need to make N + 1 requests instead of N. This is aggravated by the fact that N is now large (>100) since we intend to create one snapshot per HCA project.

It seems that it should be relatively easy to add the dataProject property to the enumerateSnapshot response. Doing so would greatly reduce the latency of certain Azul requests, enhancing the overall user experience and reducing complexity in the Azul code base.

melainalegaspi commented 2 years ago

@hannes-ucsc : "I pinged the TDR team on Slack 12/1/2021. Waiting for a response."

hannes-ucsc commented 2 years ago

Unclear if this is planned or not and for when. Making this a blocker of DataBiosphere/Azul#3572.

theathorn commented 2 years ago

Slack thread.

hannes-ucsc commented 2 years ago

Broad informed us that they have fixed the issue and deployed it to dev:

https://humancellatlas.slack.com/archives/C01360XN04S/p1642791059015600?thread_ts=1638394750.175200&cid=C01360XN04S

nadove-ucsc commented 2 years ago

Confirmed that the element is present on dev, although it is pluralized as "dataProjects" instead of the expected "dataProject".

$ curl -s "https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots?direction=asc&limit=10&offset=0&sort=created_date" "authorization: Bearer $auth_token" | jq '.items[].dataProjects'
"broad-jade-dev-data"
"broad-jade-dev-data"
"broad-jade-dev-data"
"broad-jade-dev-data"
"broad-jade-dev-data"
"broad-jade-dev-data"
"broad-jade-dev-data"
"broad-jade-dev-data"
"broad-jade-dev-data"
"broad-jade-dev-data"
theathorn commented 2 years ago

From Nicolas Malfroy-Camine: "Changes just went live on Prod (data.terra.bio)".

nadove-ucsc commented 2 years ago

https://data.terra.bio is used by Azul's prod2 instance, but the changes are not observable on https://jade-terra.datarepo-prod.broadinstitute.org, which is used by Azul's prod instance.

melainalegaspi commented 2 years ago

@hannes-ucsc :"Assuming that the Broad is not going to deploy this to TDR old prod, #3572 needs to wait until we switch to TDR new prod. We created #3782 and made it a blocker of #3572."