Clinical-Genomics-Lund / nextflow_wgs

5 stars 5 forks source link

Reproducibility issue within same sample/group due to order of inputs to SVDB_merge #172

Open alkc opened 6 months ago

alkc commented 6 months ago

The order of inputs to SVDB_merge when run in trio mode affects the annotations of merged SVs, such as SVLEN, QUAL and ID of the merged variant. The final SVLEN is dependent on the first variant in the merge list, and can affect which record the variant is matched against when matching against artifacts in loqusdb, resulting in swings in the allele frequency score in the SV rank model when rerunning the trio.

Suggested fix is to sort the input VCFs so that it always follows a proband-mother-father order prior to construction of svdb merge inputs:

https://github.com/Clinical-Genomics-Lund/nextflow_wgs/blob/a7f1317301391710ae68310cf9a435c6a29bdf59/main.nf#L3237-L3265

A quick but not entirely reliable fix could also be to sort the inputs by filename, since the filenames are in the vast majority of cases always receded by the individual sample id, followed sometimes by some suffix, it would solve the issue for anything started with bjorn for instance.

alkc commented 6 months ago

Might be worth looking into if #118 might fix this.