Closed usevani closed 2 years ago
Hi, and thanks for your questions.
Thank you for the response. I want to run the tool on samples that are on GRCh38 build so I created a reference panel containing ~83K markers using 1000 genomes data lifted over to GRCh38. I used angsd to generate the beagle file with the variant calls and genotype likelihoods. But when i run fastNGSAdmix it's not finding any overlapping sites between the beagle input and reference panel because it thinks the values in the reference panel are either invalid or below the MAF threshold. I am running with the default MAF threshold of 0. I am not sure what I am doing wrong. Any help with debugging will be greatly appreciated. thanks!
fastNGSAdmix output.
Setup: -seed 1638569193 -method 1
Ploidy of 2 has been chosen
The accelerated EM has been chosen
The adjusted method has been chosen
Convergence: -maxIter 2000 -tol 0.00000010
The following number of bootstraps have been chosen: 0
Input has this many sites without missing data 1664
Ref has this many sites 1693
This many sites in ref are either not-valid-number or below maf in any of the chosen pops 1693
No overlapping sites where found!!
I am attaching the beagle file and reference panel subsetted to chr22. I am also attached the file with the number of individuals per population. NA12878.1kg.chr22.angsd.beagle.gz refPanel_1kG_phase3_GRCh38.chr22.txt refPanel_1kG_phase3_GRCh38.nindv.txt
fastNGSAdmix command line.
${fastngsadmix} \
-likes ${outdir}/${sample}.angsd.beagle.gz \
-fname ${ref} \
-Nname ${nind} \
-whichPops all \
-out ${outdir}/${sample}.fastngsadmix
angsd command line to generate the beagle output.
${angsd} \
-i ${bam} \
-GL 2 \
-sites ${sites} \
-rf ${chroms} \
-doMajorMinor 4 \
-doGlf 2 \
-minMapQ 30 \
-minQ 20 \
-doDepth 1 \
-ref ${grch38_ref_fasta} \
-nThreads ${threads} \
-out ${outdir}/${sample}.angsd \
-doCounts 1
Hi, I am currently not on a linux computer and therefore I cannot unpack the .beagle.gz file. However can I get you to post the first few lines of the .beagle file and the .refPanel file respectively?
The problem might be that the way the chromosome or the alleles are coded differ between those two files.
The program matches the files based on chr, position and alleles.
Here are the first few lines from the two files. Do you mean the allele encoded as numerical in the beagle whereas string (A,T,..) in the refPanel? I went by how they are encoded in the example that are pointed to in the documentation. .beagle.gz
marker allele1 allele2 Ind0 Ind0 Ind0
chr22_16407911 3 1 0.999994 0.000006 0.000000
chr22_16416169 1 0 1.000000 0.000000 0.000000
chr22_16441882 2 0 0.999997 0.000003 0.000000
chr22_16506429 1 0 0.969648 0.030352 0.000000
chr22_16529497 1 3 0.000000 0.000008 0.999992
chr22_16554606 2 0 1.000000 0.000000 0.000000
chr22_16556218 0 1 0.999999 0.000001 0.000000
chr22_16595988 1 0 1.000000 0.000000 0.000000
chr22_16661695 0 1 1.000000 0.000000 0.000000
.refPanel
id chr pos name A0_freq A1 EUR EAS AMR AFR SAS
chr22_16407911 chr22 16407911 chr22_16407911_C_T C T 0.250497 0.34623 0.288184 0.624054 0.431493
chr22_16416169 chr22 16416169 chr22_16416169_T_C T C 0.116302 0.108135 0.0763689 0.295764 0.118609
chr22_16441882 chr22 16441882 chr22_16441882_A_G A G 0.138171 0.12996 0.15562 0.176248 0.127812
chr22_16506429 chr22 16506429 chr22_16506429_G_C G C 0.486762 0.43456 0.460294 0.349242 0.475
chr22_16529497 chr22 16529497 chr22_16529497_T_C T C 0.323062 0.404762 0.368876 0.400908 0.403885
chr22_16554606 chr22 16554606 chr22_16554606_A_G A G 0.17495 0.175595 0.102305 0.120272 0.187117
chr22_16556218 chr22 16556218 chr22_16556218_G_A G A 0.0526839 0.122024 0.0907781 0.316944 0.104294
chr22_16595988 chr22 16595988 chr22_16595988_T_C T C 0.196819 0.0585317 0.103746 0.00832073 0.125767
chr22_16661695 chr22 16661695 chr22_16661695_G_A G A 0.351889 0.356151 0.177233 0.0922844 0.507157
It is because the alleles are coded as letters in the .refPanel file but as numbers in the .beagle file (0=A, 1=C, 2=G, 3=T). I do not remember if there is a option in ANGSD to have it output alleles as letters, sorry.
For more see the ANGSD documentation about the .beagle file format: http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format
I looked through the ANGSD documentation and couldn't find the option to code bases as letters. I tried to replace the bases in my refPanel with numbers but fastNGSAdmix still failed. Anything else I could try? refPanel
id chr pos name A0_freq A1 EUR EAS AMR AFR SAS
chr1_88169 chr1 88169 chr1_88169_T_C 3 1 0.154179 0.353 0.473856 0.444976 0.265589
chr1_286747 chr1 286747 chr1_286747_G_A 2 0 0.172485 0 0.0915916 0.00987842 0.138075
chr1_596629 chr1 596629 chr1_596629_C_T 1 3 0.00102249 0.146939 0.0232558 0.188725 0.024109
chr1_597733 chr1 597733 chr1_597733_G_A 2 0 0 0.115132 0.0178042 0.214447 0.0243619
chr1_597818 chr1 597818 chr1_597818_T_C 3 1 0 0.0257271 0.0104478 0.0743243 0.0102975
chr1_611267 chr1 611267 chr1_611267_A_G 0 2 0.0470348 0.0355603 0.0392442 0.259016 0.116379
chr1_732994 chr1 732994 chr1_732994_A_G 0 2 0.00695825 0.266467 0.0432277 0 0.152352
chr1_825042 chr1 825042 chr1_825042_T_C 3 1 0.000994036 0.127976 0.0706052 0.102874 0.0327198
chr1_826940 chr1 826940 chr1_826940_T_C 3 1 0.000994036 0.128968 0.0720461 0.104387 0.0327198
Hi, you have to replace the alleles in the .beagle file. So translate the numbers there into bases/letters.
It can be done using awk: awk '{if ($2=="0"){$2="A"} else if ($2=="1"){$2="C"} else if ($2=="2"){$2="G"} else if ($2=="3"){$2="T"} print }' file.beagle > file2.beagle > awk '{if ($3=="0"){$3="A"} else if ($3=="1"){$3="C"} else if ($3=="2"){$3="G"} else if ($3=="3"){$3="T"} print }' file2.beagle > file3.beagle
Encoding the bases as string in the beagle file did not change anything. By removing the "chr" prefix from the chromosome seems to have fixed not-valid-number issue. The tool is not recognizing the sites and overlapping correctly but it thinks the beagle nucleotides are not valid. I tried using the original beagle file with both 0,1,2,3 and A,T,G,C encoding but none of them worked.
Setup: -seed 1638911831 -method 1
Ploidy of 2 has been chosen
The accelerated EM has been chosen
The adjusted method has been chosen
Convergence: -maxIter 2000 -tol 0.00000010
The following number of bootstraps have been chosen: 0
Input has this many sites without missing data 1664
Ref has this many sites 1693
This many sites in ref are either not-valid-number or below maf in any of the chosen pops 0
Overlap: of 1145 sites between input and ref
Chosen pop EAS
Chosen pop AMR
Chosen pop AFR
Chosen pop SAS
Beagle nucleotide not valid must be A,C,G,T or 0,1,2,3
Hi there, so there are several issues. The refPanel file has CRLF line break or Windows line breaks ('\r' instead of '\n'). You have to code the chr as an integer. And you might have to translate the alleles in the .beagle file (not sure about this).
But I did the following modifications of the files, and now it is working with fastNGSadmix:
awk '{if ($2=="0"){$2="A"} else if ($2=="1"){$2="C"} else if ($2=="2"){$2="G"} else if ($2=="3"){$2="T"} print }' NA12878.1kg.chr22.angsd.beagle > tmp.beagle awk '{if ($3=="0"){$3="A"} else if ($3=="1"){$3="C"} else if ($3=="2"){$3="G"} else if ($3=="3"){$3="T"} print }' tmp.beagle > NA12878.1kg.chr22.angsd.withAlleles.beagle rm tmp.beagle
sed -e 's/chr2/2/g' NA12878.1kg.chr22.angsd.withAlleles.beagle > NA12878.1kg.chr22.angsd.withAlleles.chrFixed.beagle sed -e 's/chr2/2/g' refPanel_1kG_phase3_GRCh38.chr22.txt > refPanel_1kG_phase3_GRCh38.chr22.noChr.txt
cat refPanel_1kG_phase3_GRCh38.chr22.noChr.txt | sed 's/\r$//' > refPanel_1kG_phase3_GRCh38.chr22.noChr.lineEnds.txt
Hi did this solve the issue? Can I close the issue?
Sorry, I didn't get a chance to look. Just got back from vacation and will try it this week.
Sorry about the late response, I wanted to confirm the root cause before commenting. From all the testing I did it's clear that the root cause of the issue was the CRLF ('\r') line breaks in refPanel text file. None of the other changes like encoding the bases as integer or trimming the "chr" prefix of the chromosome name was needed. I also looked through the code and can confirm that beagle base encoding is converted to "char" within the program. In case you are curious, this happened because I was using the python csv library to generate the refPanel and the default line terminator is '\r\n'. Thank you for all your help in debugging this issue.
Ok, cool thanks for the feedback, I will try and fix this!
I assume I can close this issue?
Yes. Thanks!
Hi, I have a couple of questions regarding the alleles that are reported in the reference panel and their allele frequency.