UPHL-BioNGS / Cecret

Reference-based consensus creation
MIT License
44 stars 22 forks source link

Allow masking of repeat regions #118

Closed erinyoung closed 6 months ago

erinyoung commented 1 year ago

From a slack thread by Andrew Rambaut on September 13, 2022:

Dear all, We have seen a number of issues with calls to reference and other likely pipeline causing unexpectedly divergent genomes. I have a couple of suggestions:

  1. Use NC_063383|MPXV-M5312_HM12_Rivers as the reference genome for assembly/consensus calling. Inappropriate calls for reference are then more noticeable (by using a B.1 reference these will not be).
  2. For genomic epidemiology, align to reference (NC_063383) and mask out repeat regions and long homopolymeric runs. There has been interest in using variation in repeats as markers but these are often inaccurately sequenced and can induce erroneous SNPs in certain pipelines/sequencing platforms. Also phylogenetic models for repeats are not widely implemented or well developed. Also mask the 3' ITR to avoid double counting shared SNPs with the 5' ITR. We have a list of maskings that we are using here: https://github.com/aineniamh/squirrel/blob/main/squirrel/data/to_mask.csv Perhaps it would be a good idea to settle on a consistent list.

Cecret already uses NC_063383 as the default reference, but point 2 is important.

These are the regions suggested for masking : https://github.com/aineniamh/squirrel/blob/main/squirrel/data/to_mask.csv

These are the regions nextclade masks : https://github.com/nextstrain/monkeypox/blob/master/config/mask.bed