chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
64 stars 14 forks source link

RDS and CXG conversions should be independent operations #5660

Closed brianraymor closed 1 year ago

brianraymor commented 1 year ago

Context

See single-cell-data-wrangling.

Recently, RDS conversions have been especially brittle. Currently, a RDS conversion failure will result in the cancellation of a pending CXG conversion although there should be no correlation between potential RDS and CXG failures. As a result, curators and submitters are unable to view the Explorer visualization until the RDS conversion is addressed. This is currently blocking a submitter review.

I do not remember if there was a rationale for this design such as fast failing.

CC: @Bento007 since this issue is based on a personal slack conversation. @atolopko-czi

atolopko-czi commented 1 year ago

I do not remember if there was a rationale for this design such as fast failing.

I'm not aware of any technical reason why fast failing behavior is required here. But we may have to consider the processing status logic to ensure that partial failures (e.g. just RDS) can be retried appropriately.

We actually want to eventually have the CXG generation be moved into a separate pipeline, so it can be owned by the Viz team. It would only be loosely coupled to the dataset submission pipeline via an event notification when a "labeled" h5ad is ready, and not an RDS. This might be worth considering when we take on this issue.

brianraymor commented 1 year ago

@nayib-jose-gloria - added to schema 4 epic for visibility. Please consider.

Bento007 commented 1 year ago

Something like this would work.

danieljhegeman commented 1 year ago

As a result, curators and submitters are unable to view the Explorer visualization until the RDS conversion is addressed.

Noting that, currently, it also blocks publishing.

@brianraymor Can you please advise on whether we should start allowing publishing in cases where seurat conversion failed?

metakuni commented 1 year ago

On a side note, I think I recall from a long time ago that despite being defined as parallel steps in the step function that the Seurat and CXG conversions actually run sequentially.

@ebezzi , is this still the case?

danieljhegeman commented 1 year ago

If the reported TaskStarted times are to be believed in the step function logs, then the cxg and seurat steps/tasks are indeed being launched in parallel rather than in sequence.

ebezzi commented 1 year ago

They indeed run in parallel. My recommendation here is to change the Seurat job to return a success code instead of a failure, so that won't block the step function execution. The failure will still be handled by the processing status management we have in the database.

brianraymor commented 1 year ago

@brianraymor Can you please advise on whether we should start allowing publishing in cases where seurat conversion failed?

It depends. If the seurat conversion failed due to the known R sparse matrix limitation, then publication is allowed. Otherwise, publication is not allowed.

danieljhegeman commented 1 year ago

I will test by forcing seurat failure for one specific test Collection in prod only once we deploy to prod on Thursday 11/2