Open ivadym opened 9 months ago
If a case is marked as not compressible, it will not be returned by get_cases_to_compress
. If I understand the problem correctly, it is that samples should be marked as not compressible instead of cases? Maybe we need to rewrite the compression logic to work on a sample basis.
If validation cases are being compressed, which are they and have we marked them as not compressible? If validation samples are associated with other cases than validation cases, why?
I think it gets the samples associated to a case at some point to decompress/compress them. Maybe we can add a check whether they their is_compressible
attribute is true. I think this is the function compress_sample_fastqs_in_cases
I agree that it is most likely in compress_sample_fastqs_in_cases
, but in an earlier part of that flow, get_cases_to_process
is called, where the filtering on whether a case is compressible is done. So the issue is not that the check is not being made, as far as I can tell, rather than the sample belonging to several cases, only some of which are marked as not compressible. Sounds weird though as @seallard has already highlighted. Either that is addressed, or we put the is_compressible on sample level I think, given that fastq/spring files are always in sample bundles either way.
If a case is marked as not compressible, it will not be returned by
get_cases_to_compress
. If I understand the problem correctly, it is that samples should be marked as not compressible instead of cases? Maybe we need to rewrite the compression logic to work on a sample basis.
Yes, we need to make the check sample aware and that could be a potential solution
If validation cases are being compressed, which are they and have we marked them as not compressible? If validation samples are associated with other cases than validation cases, why?
For cancer analysis it is usually the normal sample that is being shared between different cases
We had a bit of a refinement during our Cancer meeting and concluded that it's probably best to move the is_compressible
logic to the sample level, given that decompression is also performed there
So will that mean that samples will be marked as incompressible or will it mean cases will be marked but we should for each sample check if any connected case is marked?
We should mark samples as compressible/incompressible
If actual samples are used as validation samples, could we run into a situation where an analysis is re-ordered for the validation sample and would it be a problem that it is marked as not compressible?
I can't think of a scenario where it could cause problems. But it's a good point, maybe we should create independent cases and samples for the validations.
I opened an issue in Balsamic early this week, but probably it should be extended to all pipelines: https://github.com/Clinical-Genomics/BALSAMIC/issues/1426
@ChrOertlin has a PR open for this
Description
We're compressing validation samples that are not expected to be compressed (for example:
ACC10995A1
,ACC11273A1
,ACC5821A7
, etc).It seems that
get_cases_to_compress
function does not consider the scenario where multiple cases may share samples. Therefore, when retrieving cases for compression, the function overlooks that samples from a compressible case might also be associated with a non-compressible case.Suggested solution
Two options:
get_cases_to_compress
function to recognise that a sample associated with a compressible case might also be linked to a non-compressible one.is_compressible
field from the sample to the case level. This would entail:clean_fastq
function so that fastq cleaning occurs at the sample level.is_compressible
column from cases.This can be closed when
No more validation cases are compressed unexpectedly.
Blocked by
N/A