* in snp is not allowed in MASH pipeline `rds_to_vcf`

rfeng2023 commented 1 year ago

There are some "*" (which means a deletion) in snp file will cause the error

Error in .Call2("new_XStringSet_from_CHARACTER", class(x0), elementType(x0), : key 42 (char '*') not in lookup table

That may be caused by Biostrings::DNAStringSet(nea), which only allow the input chr as "ACTG".

gaow commented 1 year ago

Well, other than a more technical solution on this, a general question is @hsun3163 when we "harmonize " deletion data, is there a way to harmonize it as eg

GC C

as opposed to:

G *

?

hsun3163 commented 1 year ago

Well, other than a more technical solution on this, a general question is @hsun3163 when we "harmonize " deletion data, is there a way to harmonize it as eg
GC C
as opposed to:
G *
?

One immediate challenge for this is that, the C for GC C is not readily available when all we have is G *

gaow commented 1 year ago

As discussed with @rfeng2023 I think we are going to use the standard N symbol for "any basepair". ...

rfeng2023 commented 1 year ago

As discussed with @rfeng2023 I think we are going to use the standard N symbol for "any basepair". ...

should we replace it in the rds_to_vcf process or in the input file (which may need to be fixed in merged.sumstat.vcf)?

gaow commented 1 year ago

This is also a good question. @hsun3163 did you invent the G * convention or bcftools had it? If this is bcftools behavior then we better leave it as is. If it is @hsun3163 's choice then yes we may want to use GN N instead of G * because * is not a standard symbol for sequence data.

hsun3163 commented 1 year ago

They were not introduced by me, instead, they were inherited from the raw vcf.gz file, as indicated below:

hs3163@node96:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/output/data_preprocessing/genotype$ zcat DEJ_11898_B01_GRM_WGS_2017-05-15_21.recalibrated_variants.xqtl_protocol_data.add_chr.add_chr.leftnorm.filtered.vcf.gz | grep "*"  | cut -f 1,2,3,4,5,6 | head
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##bcftools_filterCommand=filter -i 'GT="hom" | TYPE="snp" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= 0.15 | TYPE="indel" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= 0.2'; Date=Wed Oct 19 15:56:29 2022
chr21   9540614 chr21:9540614:G:*       G       *       3256.49
chr21   9550890 chr21:9550890:A:*       A       *       11424.9
chr21   9553296 chr21:9553296:G:*       G       *       3665.16

Changing the * to n for only mash output may make the future comparisons between mash output and the output of other parts of our analysis pipeline difficult.

cumc / xqtl-protocol

* in snp is not allowed in MASH pipeline `rds_to_vcf` #551