DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Support for Terra on Azure #5610

Open bvizzier-ucsc opened 11 months ago

bvizzier-ucsc commented 11 months ago

The Broad is in the process of developing support for Terra running in the Azure cloud environment. AnVIL will be one of the ecosystems using Terra/Azure and Azul will need to index TDR datasets hosted in Terra/Azure.

The preliminary scoping document can be found here.

Edit: Moving over limited information from azul/#5952 since this issue has relevant content.

There have been several discussions about this and the Phase 1 plans are for the Data Explorer to:

Index the data hosted in Azure by importing the Azure Synapse Parquet files into Big Query and then using Big Query to be used to explore data that is hosted on both Azure and GCP.
Hand-off the references to the user selected data to Terra/Azure for analysis.

A second, later phase will involve using Synapse directly to process the data on Azure.

For AnVIL, a small subset of the data will continue to be co-hosted on GCP in addition to Azure. This is relevant since people wanting to do analysis on a specific platform may have ingress and egress charges. Since some of the data will be replicated, the user will need to be able to view and filter by the cloud service where the data resides.

achave11-ucsc commented 11 months ago

@hannes-ucsc: "Hannes to study documentation on Parquet files, Azure Synapse Analytics, Azure Active Directory B2C. @bvizzier-ucsc to followup with the Broad on obtaining documentation about SAM and TDR support for Azure. Benedict and Rachel and Ben will determine if, at least for a transitional period, all metadata can be kept in BigQuery on GCP, even for datasets whose data files are hosted only on Azure. This will impact how much work will be required from us before the April 2024 deadline."

bvizzier-ucsc commented 6 months ago

@hannes-ucsc When can we get an estimate on when this work can be done?

As previously discussed but not captured here, exporting the parquet files from Azure and importing them into BigQuery is acceptable as a near-term solution.

hannes-ucsc commented 6 months ago

We would first need to prototype the Parquet export and reimport to BigQuery before we can give any meaningful estimate on how much work that is. As you know, the team is fully loaded at the moment with compliance work, the upcoming compliance assessment, the MA pilot for HCA, taking over the browser for HCA production and the verbatim handover, so the next challenge would be to determine what work we can forfeit in order to squeeze the prototype in.

I don't understand the ticket structure. The epic and this child seem to be about the same thing. Lastly, it would really help to have all the relevant information (docs, slack threads) linked from one place. Ideally, the description of the epic.

dsotirho-ucsc commented 6 months ago

Assignee to clean up ticket structure and collect all relevant documentation in the description of the top-level epic.

bvizzier-ucsc commented 5 months ago

I recommend promoting this issue to an epic and closing https://github.com/DataBiosphere/azul/issues/5952 as a duplicate.

dsotirho-ucsc commented 5 months ago

Assignee to consider next steps.