Pipeline adaptation to improve results for African genomes

melnel000 commented 2 years ago

Dear ClinSV team

I would like to know if it is possible to integrate custom control data into the ClinSV pipeline. I work with Southern African population groups which are completely absent from 1000Genomes and gnomAD. I have yet to try out the ClinSV pipeline but I suspect that I am going to pick up a lot of uncatalogued structural variation which may erroneously be flagged as rare when it may just be benign variation. I do have access to a large control dataset of 700 Southern African genomes and it would be great if there was some way to use the information from this dataset for filtering.

Thanks, Melissa

J-Bradlee commented 2 years ago

I'm not exactly sure how to adapt ClinSV to be used to incorporate another control dataset. However, I know currently ClinSV is hard coded to use the reference genomes b37 and b38. I imagine incorporating another dataset would require changing some of these hard coded parameters from line 145 to 219 in the main clinsv perl script. Perhaps @drmjc can provide some insight if he's around.

melnel000 commented 2 years ago

Thanks for the feedback. I managed to get the pipeline to run successfully on my first sample. I may be able to try out your suggestion to add an appropriate African control dataset.

For the mapping distance distribution on the QC report, all the values are flagged as out of range (!!!)

Concordant size min (green): 117 [ 44, 4.9, z 15] !!! Concordant size max (green): 583 [ 943, 64, z -5.6] !!! Mean mapping distance size (blue): 306 [ 444, 26, z -5.3] !!! Stdev mapping distance size: 65 [ 110, 7.4, z -6.1] !!!

Is this due to differences in read length between my sample (150bp) and the control samples used in the pipeline?

drmjc commented 2 years ago

Hi, Thanks for your interest in ClinSV. Adding another population annotation database would be quite difficult, but not impossible. I'd think chasing all the code that refers to 'MGRB'^, then could be swapped for yours. it'd need some scripts to create the supporting files in the right format. IIRC these annotation files capture the coverage stdev, CNV calls, locations of split and spanning reads.

^ MGRB is the population db of healthy older Australian's that we used to fine tune ClinSV.

re the mapping size warnings, the control data (again MGRB) was sequenced using 150bp PE WGS on HiSeq X10 using the NanoSeq HT library prep. It certainly sounds like the the library prep and/or sequencing platform you've used are quite different to the control data. This can be pretty important for SV detection, as the tools are quite sensitive to insert sizes.

700 genomes is certainly large enough to consider creating a custom database, I'm just not sure how easy this would be. @MinocheAE, are you able to comment on how easy or hard this may be?

MinocheAE commented 2 years ago

Hi Melissa,

For ClinSV 3 different types of allele frequencies were derived from the MGRB data.

One was based on the sum of DP and SR counts (referred to as PAFSU), another was based on the normalized read depth (referred to as PAFDRA), and the third was based on the actual Pass and High confidence variants (referred to as PAFV).

I would say it could take several of months for a single bioinformatician to create a population allele reference annotation analogue to what we did for MRGB. All annotation we derived was a product of running ClinSV and subsequent merging of raw evidence files and final variant calls.

Also search for PAFSU, PAFDRA and PAFV in the Perl code.

Instead you might want to summarise and annotate the variants yourself and use ClinSV as reference or source of inspiration.

Merging structural variants calls remains a tricky task, even when using available software. It needs to be carefully checked, as different variant callers and sequencing library properties (read length and insert size) could behave unexpectedly.

KCCG / ClinSV

Pipeline adaptation to improve results for African genomes #30