Sydney-Informatics-Hub / Germline-StructuralV-nf

Germline structural variant calling pipeline for short read WGS datasets
GNU General Public License v3.0
4 stars 5 forks source link

Pipeline feedback #30

Closed gavinmonahan closed 12 months ago

gavinmonahan commented 1 year ago

Hi! I'm running this pipeline on Setonix and I'm enjoying it so far 😀 Not so much an issue but I had a few things I wanted to mention/request regarding the pipeline/AnnotSV output, mostly based on my experience using manta + AnnotSV -

  1. Gene annotations are set to full - this removes the 'split' lines containing transcript specific information which is needed for prioritising variants, in particular the frameshift column. The drawback is it will add at least one extra line per entry.
  2. Merging all of the cohort samples before annotating would be beneficial, as this would make filtering based on pedigree much easier. I think this is something you are already looking at, but I generally just do this in excel with the merged cohort AnnotSV outputs. A better way for merging and filtering these could also help to find compount het variants. This would also make it easier to remove false positive SVs and result in a smaller overall output and AnnotSV runtime. JasmineSV could be useful here.
  3. Similar to above, I think it could also be beneficial to run manta as a whole cohort rather than individually as it can then genotype each family/cohort member at each variant and provide quality scores for these, which is missing if they are all called individually, and also reduce some false positives. I'm not sure if it's possible to do this with other callers? I previously ran manta using --bam (joint diploid analysis) rather than --normalbam which I believe is for tumor analysis.

Thanks for making such a fast, comprehensive, and easy to use pipeline! Cheers, Gavin 😊

gavinmonahan commented 1 year ago

There are also a few SVs below 50bp, mostly >40bp, in my first batch. Is it possible to include an optional -SVminSize flag for AnnotSV?

georgiesamaha commented 1 year ago

Hi @gavinmonahan

Thank you for all the feedback 😄

In response to your points:

  1. This was an oversight on my part, forgot to come back to it. Sorry! Will add an optional parameter to allow you to choose either full, split, or both. I found specifying both resulted in bulky and hard to read files, given the scale of annotations provided by AnnotSV. Do you think it would it be worthwhile to have split and full annotations in separate files or better to just leave in one?

  2. Have investigated this (and Jasmine) but in our attempt to maximise sensitivity for rare traits we’ve limited our ability to merge multiple samples effectively, as a trade-off. Currently, we provide a merged VCFs for each sample with 3 genotype/sample columns, so you are able to explore edge-cases where there’s no consensus between callers. Each of those sample columns represents the genotype info from each caller. This makes merging multiple individual’s VCFs tricky. The only way around this that I see is to merge VCFs at the caller level with Jasmine, which would create a very bloated cohort VCF, depending on the size of your cohort. We were going for broad application with the first iteration of this workflow and aware that for data processing users wouldn’t necessarily be running the workflow on distinct cohorts, but rather run it progressively on individual samples and/or in batches of varying sizes and numbers of cohorts, so decided to leave things like cohort merging and filtering to downstream work. We are discussing a downstream workflow focused on the cohort level that would handle filtering and prioritisation. Let’s chat about this, I’ll email you.

  3. Same as the point above. Workflow is focused at sample-level for sake of sensitivity and broad application. Running Manta at the cohort level would only give you Manta variants and exclude the other callers. Also important to note there's a lack of standardisation among SV caller developers about VCF file formatting. That makes merging very challenging (hence the need for tools like Jasmine).

gavinmonahan commented 1 year ago

Thanks Georgie!

All very good points and I would be happy to chat about it soon 😀 I can see how having too many samples with so many callers will make the cohort VCF really large. A happy middle could be running it on a per-family basis, for example we usually have singletons or trios. Previously, for annotSV I have found the split can be confusing without the full for large (multigene) SVs, so I used 'both' to keep them together before filtering them down. I agree that the outputs are way too bloated, so having them as seperate files could be usefull too, or just the split annotation alone. I forked the repo last week and made some of those changes, including for annotSV. Although my experience with netflow is non existent it seemed to work so let me know if you want to merge it back to main.

georgiesamaha commented 1 year ago

Hi @gavinmonahan,

Made a few changes following your feedback/suggestions:

Want to give them a go and let me know what you think? 👀

gavinmonahan commented 1 year ago

Hi @georgiesamaha,

That looks great! I haven't had a chance to run it yet with these changes but I think they are very useful changes. I'll let you know if I have any comments/issues ASAP 😊