freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
142 stars 24 forks source link

Loci skipped with SourceSeq mapping. #65

Open rajwanir opened 6 months ago

rajwanir commented 6 months ago

A separate issue from the results in #64 is that 0.5-0.8% of markers could not be mapped from hg19 to hg38. While I think some percentage will be skipped anyway. Do you have any suggestions to reduce this number of skipped loci?

Currently I think, one of the issue might be that mappings are made to the no ALT version (GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.) which is necessary but could filter the snps on ALT contigs. A possible workaround could be to try a two step alignment, first align to choromosomal assembly (no_alt) and then sequences that don't align can be mapped to primary assembly. Is the something you think might be feasible with the current version of gtc2vcf with just by piping or parameter change?

The second option might be to use an alt-aware aligner (https://github.com/lh3/bwa/blob/master/README-alt.md).

Wanted to check if you have any thoughts or suggestions on this. Thanks.

freeseek commented 6 months ago

I would advise to look at some examples for which markers do not map to hg38. It is entirely possible that they are markers that map to sequence missing from hg38 but not from hg19 (for example in 7q). That is why I also advise against using liftover. Some markers might have been designed for hg18, have no mapping in hg19, but have mappings in hg38, and using liftover from hg19 would not recover these. The way BCFtools/gtc2vcf maps markers is by aligning using BWA/mem both the reference and the alternate flanking sequences and picking the alignment with the better mapping. If both sequences align equally well to different loci then the marker is dropped