From a slack thread by Andrew Rambaut on September 13, 2022:
Dear all,
We have seen a number of issues with calls to reference and other likely pipeline causing unexpectedly divergent genomes. I have a couple of suggestions:
Use NC_063383|MPXV-M5312_HM12_Rivers as the reference genome for assembly/consensus calling. Inappropriate calls for reference are then more noticeable (by using a B.1 reference these will not be).
For genomic epidemiology, align to reference (NC_063383) and mask out repeat regions and long homopolymeric runs. There has been interest in using variation in repeats as markers but these are often inaccurately sequenced and can induce erroneous SNPs in certain pipelines/sequencing platforms. Also phylogenetic models for repeats are not widely implemented or well developed. Also mask the 3' ITR to avoid double counting shared SNPs with the 5' ITR. We have a list of maskings that we are using here: https://github.com/aineniamh/squirrel/blob/main/squirrel/data/to_mask.csv Perhaps it would be a good idea to settle on a consistent list.
Cecret already uses NC_063383 as the default reference, but point 2 is important.
From a slack thread by Andrew Rambaut on September 13, 2022:
Cecret already uses NC_063383 as the default reference, but point 2 is important.
These are the regions suggested for masking : https://github.com/aineniamh/squirrel/blob/main/squirrel/data/to_mask.csv
These are the regions nextclade masks : https://github.com/nextstrain/monkeypox/blob/master/config/mask.bed