Clinical-Genomics / cg

Glue between Clinical Genomics apps
8 stars 2 forks source link

Not compressible samples are being compressed #2801

Open ivadym opened 9 months ago

ivadym commented 9 months ago

Description

We're compressing validation samples that are not expected to be compressed (for example: ACC10995A1, ACC11273A1, ACC5821A7, etc).

It seems that get_cases_to_compress function does not consider the scenario where multiple cases may share samples. Therefore, when retrieving cases for compression, the function overlooks that samples from a compressible case might also be associated with a non-compressible case.

Suggested solution

Two options:

  1. Modify the get_cases_to_compress function to recognise that a sample associated with a compressible case might also be linked to a non-compressible one.
  2. Move the is_compressible field from the sample to the case level. This would entail:
    • Updating the clean_fastq function so that fastq cleaning occurs at the sample level.
    • Dropping the is_compressible column from cases.
    • Adding a new column at the sample level and updating it accordingly.
  3. Detach validation cases and samples from actual customer-ordered samples

    This can be closed when

    No more validation cases are compressed unexpectedly.

Blocked by

N/A

seallard commented 9 months ago

If a case is marked as not compressible, it will not be returned by get_cases_to_compress. If I understand the problem correctly, it is that samples should be marked as not compressible instead of cases? Maybe we need to rewrite the compression logic to work on a sample basis.

If validation cases are being compressed, which are they and have we marked them as not compressible? If validation samples are associated with other cases than validation cases, why?

ChrOertlin commented 9 months ago

I think it gets the samples associated to a case at some point to decompress/compress them. Maybe we can add a check whether they their is_compressible attribute is true. I think this is the function compress_sample_fastqs_in_cases

islean commented 9 months ago

I agree that it is most likely in compress_sample_fastqs_in_cases, but in an earlier part of that flow, get_cases_to_process is called, where the filtering on whether a case is compressible is done. So the issue is not that the check is not being made, as far as I can tell, rather than the sample belonging to several cases, only some of which are marked as not compressible. Sounds weird though as @seallard has already highlighted. Either that is addressed, or we put the is_compressible on sample level I think, given that fastq/spring files are always in sample bundles either way.

ivadym commented 9 months ago

If a case is marked as not compressible, it will not be returned by get_cases_to_compress. If I understand the problem correctly, it is that samples should be marked as not compressible instead of cases? Maybe we need to rewrite the compression logic to work on a sample basis.

Yes, we need to make the check sample aware and that could be a potential solution

If validation cases are being compressed, which are they and have we marked them as not compressible? If validation samples are associated with other cases than validation cases, why?

For cancer analysis it is usually the normal sample that is being shared between different cases

ivadym commented 9 months ago

We had a bit of a refinement during our Cancer meeting and concluded that it's probably best to move the is_compressible logic to the sample level, given that decompression is also performed there

islean commented 9 months ago

So will that mean that samples will be marked as incompressible or will it mean cases will be marked but we should for each sample check if any connected case is marked?

ivadym commented 9 months ago

We should mark samples as compressible/incompressible

seallard commented 6 months ago

If actual samples are used as validation samples, could we run into a situation where an analysis is re-ordered for the validation sample and would it be a problem that it is marked as not compressible?

ivadym commented 6 months ago

I can't think of a scenario where it could cause problems. But it's a good point, maybe we should create independent cases and samples for the validations.

I opened an issue in Balsamic early this week, but probably it should be extended to all pipelines: https://github.com/Clinical-Genomics/BALSAMIC/issues/1426

Vince-janv commented 6 months ago

Decision:

diitaz93 commented 2 weeks ago

@ChrOertlin has a PR open for this