epi2me-labs / wf-human-variation

Other
87 stars 41 forks source link

False positive over-segmentation of sex chromosomes during CNV workflow #95

Closed Tintest closed 6 months ago

Tintest commented 9 months ago

Operating System

Other Linux (please specify below)

Other Linux

Debian GNU/Linux 11 (bullseye)

Workflow Version

v1.7.2

Workflow Execution

Command line

EPI2ME Version

No response

CLI command run

./nextflow-23.04.2-all run wf-human-variation/main.nf \
-w wf-human-variation \
-profile singularity \
--ref REF/GRCh38.p13.primary_assembly.decoy.fa \
--bam results/sample010/sample010.bam \
--sample_name sample010 \
--bam_min_coverage 10 \
--cnv \
--str \
--tr_bed wf-human-variation/data/clinical_repeats.bed \
--bin_size 5 \
--sex male \
-resume

Workflow Execution - CLI Execution Profile

singularity

What happened?

Hello,

I'm writing to you because I've run the wf-human-variation CNV workflow on 3 PromethION sequenced samples. For all 3, whatever the size of the bin, I find myself with entire sex chromosomes called as CNVs. The number of segments is larger or smaller depending on the bin, while I have specified the sex (but I think it's not used by the CNV workflow). I haven't masked the PAR regions, but the problem seems to be more large. Sample 10 and 11 are male, sample 12 is female.

Is there anything I can do to avoid this? Is this a bug, or did I miss something ?

Regards.

Relevant log output

for i in */qdna*5/*.seg ; do echo $i ; grep "X\|Y" $i ; done
sample010/qdna_seq_bin5/sample010_segs.seg
sample010_segs.seg      X       10001   2770000 474     -14
sample010_segs.seg      X       3830001 3920000 18      -9.6
sample010_segs.seg      X       37085001        37100000        3       -1022
sample010_segs.seg      X       52670001        52775000        21      -5.61
sample010_segs.seg      X       52885001        52965000        16      -2.09
sample010_segs.seg      X       55450001        55510000        12      -1.23
sample010_segs.seg      X       63130001        63235000        21      -4.34
sample010_segs.seg      X       71685001        71785000        20      -5.25
sample010_segs.seg      X       72740001        72965000        45      -6.28
sample010_segs.seg      X       102200001       102325000       23      -11.05
sample010_segs.seg      X       102350001       102475000       24      -9.22
sample010_segs.seg      X       103955001       103975000       4       -5.17
sample010_segs.seg      X       103975001       104050000       15      0.43
sample010_segs.seg      X       104050001       104080000       6       -2.72
sample010_segs.seg      X       120040001       120065000       5       -3.53
sample010_segs.seg      X       120150001       120195000       8       -1.69
sample010_segs.seg      X       135215001       135240000       5       -7.68
sample010_segs.seg      X       135760001       135875000       13      -7.91
sample010_segs.seg      X       141005001       141100000       19      -2.2
sample010_segs.seg      X       141475001       141580000       20      -1.11
sample010_segs.seg      X       144120001       144175000       11      -1.33
sample010_segs.seg      X       149545001       149790000       49      -0.53
sample010_segs.seg      X       152680001       152790000       22      -0.99
sample010_segs.seg      X       153105001       153135000       6       -5.87
sample010_segs.seg      X       154150001       154280000       26      -4.65
sample010_segs.seg      X       154555001       154635000       16      -1.63
sample010_segs.seg      X       155335001       155375000       8       -4.85
sample010_segs.seg      X       155700001       156035000       66      -9.54
sample010_segs.seg      Y       2780001 3835000 210     -18
sample010_segs.seg      Y       3835001 3855000 4       -1.85
sample010_segs.seg      Y       3855001 26675000        4214    -19.17
sample011/qdna_seq_bin5/sample011_segs.seg
sample011_segs.seg      X       10001   2765000 473     -19.58
sample011_segs.seg      X       2765001 3830000 211     -1.05
sample011_segs.seg      X       3830001 3925000 19      -9.02
sample011_segs.seg      X       3925001 52175000        9493    -1.05
sample011_segs.seg      X       52175001        52195000        4       -1022
sample011_segs.seg      X       52195001        52670000        95      -1.26
sample011_segs.seg      X       52670001        52780000        22      -5.43
sample011_segs.seg      X       52780001        63125000        1097    -1.07
sample011_segs.seg      X       63125001        63235000        22      -4.81
sample011_segs.seg      X       63235001        71685000        1666    -1.06
sample011_segs.seg      X       71685001        71780000        19      -8.15
sample011_segs.seg      X       71780001        72745000        192     -1.14
sample011_segs.seg      X       72745001        72965000        44      -9.19
sample011_segs.seg      X       72970001        102200000       5778    -1.06
sample011_segs.seg      X       102200001       102470000       51      -7.33
sample011_segs.seg      X       102470001       103945000       286     -1.08
sample011_segs.seg      X       103945001       103975000       6       -5.99
sample011_segs.seg      X       104050001       104080000       6       -2.68
sample011_segs.seg      X       104080001       120040000       3116    -1.05
sample011_segs.seg      X       120040001       120070000       6       -8.92
sample011_segs.seg      X       120070001       135215000       2975    -1.05
sample011_segs.seg      X       135215001       135240000       5       -8.37
sample011_segs.seg      X       135240001       135760000       103     -1.12
sample011_segs.seg      X       135760001       135860000       12      -1022
sample011_segs.seg      X       135870001       141050000       1029    -1.05
sample011_segs.seg      X       141050001       141080000       6       -6.94
sample011_segs.seg      X       141080001       141495000       83      -1.2
sample011_segs.seg      X       141495001       141565000       14      -4.17
sample011_segs.seg      X       141570001       153110000       2288    -1.08
sample011_segs.seg      X       153110001       153130000       4       -1022
sample011_segs.seg      X       153130001       154150000       204     -0.91
sample011_segs.seg      X       154150001       154280000       26      -5.9
sample011_segs.seg      X       154280001       155700000       282     -1.22
sample011_segs.seg      X       155700001       156035000       66      -10.37
sample011_segs.seg      Y       2780001 6285000 692     -1.12
sample011_segs.seg      Y       6285001 6465000 36      -5.17
sample011_segs.seg      Y       6465001 9050000 512     -1.07
sample011_segs.seg      Y       9055001 9120000 4       1.11
sample011_segs.seg      Y       9120001 9335000 42      -0.94
sample011_segs.seg      Y       9340001 9495000 13      -4.31
sample011_segs.seg      Y       9500001 9690000 36      -1.06
sample011_segs.seg      Y       9690001 9875000 37      -1022
sample011_segs.seg      Y       9875001 10650000        67      -0.96
sample011_segs.seg      Y       11115001        13985000        518     -1.02
sample011_segs.seg      Y       13985001        14045000        11      -3.81
sample011_segs.seg      Y       14045001        16160000        421     -1.02
sample011_segs.seg      Y       16160001        16255000        19      -10.29
sample011_segs.seg      Y       16255001        16315000        12      -1.54
sample011_segs.seg      Y       16315001        16410000        19      -12.24
sample011_segs.seg      Y       16410001        17460000        210     -0.96
sample011_segs.seg      Y       17460001        18620000        228     -12.41
sample011_segs.seg      Y       18620001        18680000        12      -1.04
sample011_segs.seg      Y       18680001        18855000        35      -1022
sample011_segs.seg      Y       18855001        21365000        463     -1.1
sample011_segs.seg      Y       21500001        21555000        10      -7.93
sample011_segs.seg      Y       21555001        21715000        32      -0.99
sample011_segs.seg      Y       21715001        22190000        90      -2.44
sample011_segs.seg      Y       22190001        22380000        37      -1.09
sample011_segs.seg      Y       22380001        22715000        65      -15.9
sample011_segs.seg      Y       22715001        22745000        6       -0.74
sample011_segs.seg      Y       22745001        23125000        74      -12.69
sample011_segs.seg      Y       23125001        23180000        6       -1.55
sample011_segs.seg      Y       23180001        26295000        574     -12.21
sample011_segs.seg      Y       26295001        26675000        75      -0.76
sample012/qdna_seq_bin5/sample012_segs.seg
sample012_segs.seg      X       10001   2765000 473     -15.07
sample012_segs.seg      X       3830001 3920000 18      -8.6
sample012_segs.seg      X       37085001        37100000        3       -1022
sample012_segs.seg      X       48365001        48430000        13      -1.18
sample012_segs.seg      X       51670001        51725000        11      -1.35
sample012_segs.seg      X       52040001        52065000        5       -5.14
sample012_segs.seg      X       52670001        52770000        20      -4.31
sample012_segs.seg      X       52885001        52965000        16      -2.12
sample012_segs.seg      X       63130001        63235000        21      -4.39
sample012_segs.seg      X       71685001        71780000        19      -5.16
sample012_segs.seg      X       72740001        72965000        45      -6.67
sample012_segs.seg      X       102200001       102325000       23      -7.64
sample012_segs.seg      X       102350001       102475000       24      -10.52
sample012_segs.seg      X       104050001       104080000       6       -2.9
sample012_segs.seg      X       120035001       120070000       7       -2.64
sample012_segs.seg      X       120145001       120195000       9       -1.85
sample012_segs.seg      X       135215001       135240000       5       -8.49
sample012_segs.seg      X       135760001       135875000       13      -9.21
sample012_segs.seg      X       141005001       141100000       19      -1.38
sample012_segs.seg      X       141475001       141565000       18      -1.87
sample012_segs.seg      X       144080001       144200000       24      -1.27
sample012_segs.seg      X       152745001       152765000       4       -5.87
sample012_segs.seg      X       153105001       153135000       6       -7.36
sample012_segs.seg      X       154155001       154280000       25      -4.93
sample012_segs.seg      X       155335001       155370000       7       -5.96
sample012_segs.seg      X       155700001       156035000       66      -10.29
sample012_segs.seg      Y       2780001 3830000 209     -1022
sample012_segs.seg      Y       3830001 3850000 4       -1.42
sample012_segs.seg      Y       3850001 11080000        1265    -19.55
sample012_segs.seg      Y       11080001        11115000        7       0.59
sample012_segs.seg      Y       11115001        26675000        2943    -13.55

Application activity log entry

No response

TBradley27 commented 9 months ago

I am just another user, but for what it is worth I do not observe this oversegmentation when running the cnv module with a hg19 aligned DNA sample with 30kb bin sizes

Tintest commented 9 months ago

Hello @TBradley27, thank you for your reply.

Have you done anything specific to your reference genome that might explain this difference, like having masked certain regions or something else ? As the BAM files were produced with wf-human-variation the only difference I suspect is at the genome level, but maybe I'm forgetting something. I'm personally experiencing the same problem with 30kb bins.

Thanks for your help, regards.

TBradley27 commented 9 months ago

Hello @Tintest,

I was using the hs37d5 build of hg19, which uses the standard build along with some viral and decoy sequences (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz)

It may be an idea to align your genome to this sequence or maybe another hg19 build to see if you see the same problems especially considering that the hg38 bin annotations are not canonical in the sense they were not released by QDNASeq authors

Tintest commented 9 months ago

Hello @TBradley27, thank you for your reply.

I'll try with your genome build ASAP.

Thanks for your help, regards.

vlshesketh commented 9 months ago

Hi @Tintest, we aren't aware of any specific build 38 issues, but please do let us know if @TBradley27's advice to try hg19 shows the same problem. You are correct, the --sex parameter is not used by the CNV sub-workflow.

Tintest commented 8 months ago

I've runed the pipeline with the hs37d5.fa.gz genome indicated by @TBradley27 but also with a version of the GRCh38 genome with some regions masked (including the PAR regions) and it hasn't changed anything, I still have the same problem. Do you have any ideas ?

Regards.

vlshesketh commented 7 months ago

Hi @Tintest, sorry to hear that you're still experiencing difficulties. Currently, the only parameter that can be adjusted for QDNAseq is the bin size. Unfortunately, there are no additional modifications that we can make at this moment. However, it's worth mentioning that we are actively exploring alternative CNV callers, so I encourage you to stay tuned and keep an eye on our repository for any forthcoming updates.