chanzuckerberg / single-cell

A collection of documents that reflect various design decisions that have been made for the cellxgene project.
MIT License
4 stars 2 forks source link

Bandwidth-related performance of Census data access is enhanced. #453

Closed pablo-gar closed 9 months ago

pablo-gar commented 1 year ago

The Census API primary goal is to provide access patterns to the Census TileDB-SOMA object. In the Census V1, the priority was to provide a stable working API and data host strategy.

For Census V2 a main focus to continue our efforts to enable data access and analysis at scale and performance improvements can enable and accelerate these journeys.

This epic relates specifically to bandwidth issues.

Context

Challenges

Stories

  1. I am a computational biologist who would like to apply a light weight machine learning method to the Census data. My model needs to do several passes over all Census data, downloading the data is the slowest part of my pipeline.
  2. I am a computational biologist in Australia with strong bandwidth capacity and would like to get medium to large slices of data in a consistent basis for downstream analysis, however access the west coast US census location introduces a bandwidth cap.

Product requirements

  1. The Census API can access mirrors of the Census Data across the major worldwide locations. 1 . Suggested: US west coast, US east coast, Europe, East Asia, Australia.
  2. The Census Data can be easily downloaded and accessed locally.

Anticipated resources needed

brianraymor commented 1 year ago

RE 2. The Census Data can be easily downloaded and accessed locally. - @pablo-gar - can you clarify per the comment thread in your draft census roadmap?

pablo-gar commented 9 months ago

Functionality is completed, mirrors have not been uploaded due to prioritization