PacificBiosciences / pb-human-wgs-workflow-wdl

BSD 3-Clause Clear License
12 stars 9 forks source link

Implement kmer consistency #61

Closed vsmalladi closed 2 years ago

vsmalladi commented 2 years ago

kmer consistency check was never implemented in the WDL workflow, so none of the modimer outputs are used.

For reference: https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake/blob/main/rules/sample_kmer_consistency.smk

This is a pairwise comparison of the modimers.tsv files from each movie, after subtracting the reference modimers.tsv. Essentially:

movie1 modimers - reference modimers -> non-reference movie1 modimers movie2 modimers - reference modimers -> non-reference movie2 modimers

count(subtract(non-reference movie1 modimers, non-reference movie2 modimers)) + count(subtract(non-reference movie2 modimers, non-reference movie1 modimers)) -> count of unique modimers from pairwise comparison

We report a metric representing the proportion of total non-reference modimers that are unique to one movie. If this count is above some threshold, it's likely that the two movies (SMRT Cells) were loaded with different samples.