cortes-ciriano-lab / savana

Somatic structural variant caller for long-read data
Apache License 2.0
43 stars 2 forks source link

BrokenPipeError: [Errno 32] Broken pipe #40

Closed ywzhang071394 closed 1 month ago

ywzhang071394 commented 2 months ago

Hi,

SAVANA works well for a small BAM file BUT reports "BrokenPipeError: [Errno 32] Broken pipe" for a larger BAM file with ~50x depth. why this happened?

Also I do not see the model for Pacbio HiFi. Does it mean SAVANA is not compatible for Pacbio WGS data?

Thank you!

helrick commented 1 month ago

Hi there! Thanks for using SAVANA. This error could be due to an out of memory error on the larger BAM file. Are you specifying the threads argument? By default, SAVANA uses all threads available, but I would recommend manually lowering this to, for example, 8 threads by adding --threads 8 to your arguments on the command line for BAMs with higher depth.

For your second question, SAVANA is compatible with PacBio, but it does not have a model trained on PacBio data available. I would recommend looking into the Classify by Parameter File section in the README to set thresholds on filtering for PacBio. For example:

{
        "somatic": {
                "MAX_NORMAL_SUPPORT": 0,
                "MIN_TUMOUR_SUPPORT": 7,
                "MAX_ORIGIN_STARTS_STD_DEV": 50,
                "MAX_END_STARTS_STD_DEV": 50,
                "MIN_ORIGIN_MAPQ_MEAN": 40,
                "MIN_END_MAPQ_MEAN": 40,
                "MAX_ORIGIN_EVENT_SIZE_STD_DEV": 60,
                "MAX_END_EVENT_SIZE_STD_DEV": 60
        }
}

In the next version of SAVANA there will be a --pb flag which will automatically apply filters which we've found work well in our PacBio data.

ywzhang071394 commented 1 month ago

Thank you for the helpful reply! Hopefully, SAVANA could be improved with your efforts!

ywzhang071394 commented 1 month ago

Hi,

I tried using custom parameters to reclassify VCF files and found this parameter was not allowed with --model. Therefore, I supposed somatic calling on Pacbio data (as you suggested above) is roughly based on some hard cutoffs. It this true? If so, did you evaluate the performance? I am just curious about if the parameters for Pacbio can keep a comparable performance as the model used for ONT.

Thanks!