Open rfeng2023 opened 1 year ago
Well, other than a more technical solution on this, a general question is @hsun3163 when we "harmonize " deletion data, is there a way to harmonize it as eg
GC C
as opposed to:
G *
?
Well, other than a more technical solution on this, a general question is @hsun3163 when we "harmonize " deletion data, is there a way to harmonize it as eg
GC C
as opposed to:
G *
?
One immediate challenge for this is that, the C
for GC C
is not readily available when all we have is G *
As discussed with @rfeng2023 I think we are going to use the standard N
symbol for "any basepair". ...
As discussed with @rfeng2023 I think we are going to use the standard
N
symbol for "any basepair". ...
should we replace it in the rds_to_vcf
process or in the input file (which may need to be fixed in merged.sumstat.vcf
)?
This is also a good question. @hsun3163 did you invent the G *
convention or bcftools had it? If this is bcftools behavior then we better leave it as is. If it is @hsun3163 's choice then yes we may want to use GN N
instead of G *
because *
is not a standard symbol for sequence data.
They were not introduced by me, instead, they were inherited from the raw vcf.gz file, as indicated below:
hs3163@node96:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/output/data_preprocessing/genotype$ zcat DEJ_11898_B01_GRM_WGS_2017-05-15_21.recalibrated_variants.xqtl_protocol_data.add_chr.add_chr.leftnorm.filtered.vcf.gz | grep "*" | cut -f 1,2,3,4,5,6 | head
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##bcftools_filterCommand=filter -i 'GT="hom" | TYPE="snp" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= 0.15 | TYPE="indel" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= 0.2'; Date=Wed Oct 19 15:56:29 2022
chr21 9540614 chr21:9540614:G:* G * 3256.49
chr21 9550890 chr21:9550890:A:* A * 11424.9
chr21 9553296 chr21:9553296:G:* G * 3665.16
Changing the * to n for only mash output may make the future comparisons between mash output and the output of other parts of our analysis pipeline difficult.
There are some "*" (which means a deletion) in snp file will cause the error
Error in .Call2("new_XStringSet_from_CHARACTER", class(x0), elementType(x0), : key 42 (char '*') not in lookup table
That may be caused by
Biostrings::DNAStringSet(nea)
, which only allow the input chr as "ACTG".