Closed brianraymor closed 1 year ago
I do not remember if there was a rationale for this design such as fast failing.
I'm not aware of any technical reason why fast failing behavior is required here. But we may have to consider the processing status logic to ensure that partial failures (e.g. just RDS) can be retried appropriately.
We actually want to eventually have the CXG generation be moved into a separate pipeline, so it can be owned by the Viz team. It would only be loosely coupled to the dataset submission pipeline via an event notification when a "labeled" h5ad is ready, and not an RDS. This might be worth considering when we take on this issue.
@nayib-jose-gloria - added to schema 4 epic for visibility. Please consider.
As a result, curators and submitters are unable to view the Explorer visualization until the RDS conversion is addressed.
Noting that, currently, it also blocks publishing.
@brianraymor Can you please advise on whether we should start allowing publishing in cases where seurat conversion failed?
On a side note, I think I recall from a long time ago that despite being defined as parallel steps in the step function that the Seurat and CXG conversions actually run sequentially.
@ebezzi , is this still the case?
If the reported TaskStarted
times are to be believed in the step function logs, then the cxg and seurat steps/tasks are indeed being launched in parallel rather than in sequence.
They indeed run in parallel. My recommendation here is to change the Seurat job to return a success code instead of a failure, so that won't block the step function execution. The failure will still be handled by the processing status management we have in the database.
@brianraymor Can you please advise on whether we should start allowing publishing in cases where seurat conversion failed?
It depends. If the seurat conversion failed due to the known R sparse matrix limitation, then publication is allowed. Otherwise, publication is not allowed.
I will test by forcing seurat failure for one specific test Collection in prod only once we deploy to prod on Thursday 11/2
Context
See single-cell-data-wrangling.
Recently, RDS conversions have been especially brittle. Currently, a RDS conversion failure will result in the cancellation of a pending CXG conversion although there should be no correlation between potential RDS and CXG failures. As a result, curators and submitters are unable to view the Explorer visualization until the RDS conversion is addressed. This is currently blocking a submitter review.
I do not remember if there was a rationale for this design such as fast failing.
CC: @Bento007 since this issue is based on a personal slack conversation. @atolopko-czi