allele frequency in the reference panel

usevani commented 2 years ago

Hi, I have a couple of questions regarding the alleles that are reported in the reference panel and their allele frequency.

Does the allele that is reported in the A0_freq column has to be the major allele? How is major allele defined? Is this the base that is reported in the reference fasta?
The documentation says "The frequencies have to be in direction of the A0_freq allele", does this mean I am reporting the allele frequencies of the base that is in the column? Thanks!

e-jorsboe commented 2 years ago

Hi, and thanks for your questions.

No it does not have to be the major allele. And it is a bit tricky to define the major allele when you are dealing with population specific frequencies, as they will vary across populations. And one allele might be the major allele in one population, but the minor allele in another population.
Yes. That is what it means, the frequency for each population, has to be the frequency of the "A0_freq" allele.

usevani commented 2 years ago

Thank you for the response. I want to run the tool on samples that are on GRCh38 build so I created a reference panel containing ~83K markers using 1000 genomes data lifted over to GRCh38. I used angsd to generate the beagle file with the variant calls and genotype likelihoods. But when i run fastNGSAdmix it's not finding any overlapping sites between the beagle input and reference panel because it thinks the values in the reference panel are either invalid or below the MAF threshold. I am running with the default MAF threshold of 0. I am not sure what I am doing wrong. Any help with debugging will be greatly appreciated. thanks!

fastNGSAdmix output.

Setup: -seed 1638569193 -method 1
Ploidy of 2 has been chosen

The accelerated EM has been chosen
The adjusted method has been chosen
Convergence: -maxIter 2000 -tol 0.00000010
The following number of bootstraps have been chosen: 0
Input has this many sites without missing data 1664
Ref has this many sites 1693
This many sites in ref are either not-valid-number or below maf in any of the chosen pops 1693
No overlapping sites where found!!

I am attaching the beagle file and reference panel subsetted to chr22. I am also attached the file with the number of individuals per population. NA12878.1kg.chr22.angsd.beagle.gz refPanel_1kG_phase3_GRCh38.chr22.txt refPanel_1kG_phase3_GRCh38.nindv.txt

fastNGSAdmix command line.

${fastngsadmix}  \
    -likes ${outdir}/${sample}.angsd.beagle.gz \
    -fname ${ref} \
    -Nname ${nind} \
    -whichPops all \
    -out  ${outdir}/${sample}.fastngsadmix

angsd command line to generate the beagle output.

${angsd} \
    -i ${bam} \
    -GL 2 \
    -sites ${sites} \
    -rf ${chroms} \
    -doMajorMinor 4 \
    -doGlf 2 \
    -minMapQ 30 \
    -minQ 20 \
    -doDepth 1 \
    -ref ${grch38_ref_fasta} \
    -nThreads ${threads} \
    -out ${outdir}/${sample}.angsd \
    -doCounts 1

e-jorsboe commented 2 years ago

Hi, I am currently not on a linux computer and therefore I cannot unpack the .beagle.gz file. However can I get you to post the first few lines of the .beagle file and the .refPanel file respectively?

The problem might be that the way the chromosome or the alleles are coded differ between those two files.

The program matches the files based on chr, position and alleles.

usevani commented 2 years ago

Here are the first few lines from the two files. Do you mean the allele encoded as numerical in the beagle whereas string (A,T,..) in the refPanel? I went by how they are encoded in the example that are pointed to in the documentation. .beagle.gz

marker  allele1 allele2 Ind0    Ind0    Ind0
chr22_16407911  3   1   0.999994    0.000006    0.000000
chr22_16416169  1   0   1.000000    0.000000    0.000000
chr22_16441882  2   0   0.999997    0.000003    0.000000
chr22_16506429  1   0   0.969648    0.030352    0.000000
chr22_16529497  1   3   0.000000    0.000008    0.999992
chr22_16554606  2   0   1.000000    0.000000    0.000000
chr22_16556218  0   1   0.999999    0.000001    0.000000
chr22_16595988  1   0   1.000000    0.000000    0.000000
chr22_16661695  0   1   1.000000    0.000000    0.000000

.refPanel

id chr pos name A0_freq A1 EUR EAS AMR AFR SAS
chr22_16407911 chr22 16407911 chr22_16407911_C_T C T 0.250497 0.34623 0.288184 0.624054 0.431493
chr22_16416169 chr22 16416169 chr22_16416169_T_C T C 0.116302 0.108135 0.0763689 0.295764 0.118609
chr22_16441882 chr22 16441882 chr22_16441882_A_G A G 0.138171 0.12996 0.15562 0.176248 0.127812
chr22_16506429 chr22 16506429 chr22_16506429_G_C G C 0.486762 0.43456 0.460294 0.349242 0.475
chr22_16529497 chr22 16529497 chr22_16529497_T_C T C 0.323062 0.404762 0.368876 0.400908 0.403885
chr22_16554606 chr22 16554606 chr22_16554606_A_G A G 0.17495 0.175595 0.102305 0.120272 0.187117
chr22_16556218 chr22 16556218 chr22_16556218_G_A G A 0.0526839 0.122024 0.0907781 0.316944 0.104294
chr22_16595988 chr22 16595988 chr22_16595988_T_C T C 0.196819 0.0585317 0.103746 0.00832073 0.125767
chr22_16661695 chr22 16661695 chr22_16661695_G_A G A 0.351889 0.356151 0.177233 0.0922844 0.507157

e-jorsboe commented 2 years ago

It is because the alleles are coded as letters in the .refPanel file but as numbers in the .beagle file (0=A, 1=C, 2=G, 3=T). I do not remember if there is a option in ANGSD to have it output alleles as letters, sorry.

For more see the ANGSD documentation about the .beagle file format: http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods#Beagle_format

usevani commented 2 years ago

I looked through the ANGSD documentation and couldn't find the option to code bases as letters. I tried to replace the bases in my refPanel with numbers but fastNGSAdmix still failed. Anything else I could try? refPanel

id  chr pos name    A0_freq A1  EUR EAS AMR AFR SAS
chr1_88169  chr1    88169   chr1_88169_T_C  3   1   0.154179    0.353   0.473856    0.444976    0.265589
chr1_286747 chr1    286747  chr1_286747_G_A 2   0   0.172485    0   0.0915916   0.00987842  0.138075
chr1_596629 chr1    596629  chr1_596629_C_T 1   3   0.00102249  0.146939    0.0232558   0.188725    0.024109
chr1_597733 chr1    597733  chr1_597733_G_A 2   0   0   0.115132    0.0178042   0.214447    0.0243619
chr1_597818 chr1    597818  chr1_597818_T_C 3   1   0   0.0257271   0.0104478   0.0743243   0.0102975
chr1_611267 chr1    611267  chr1_611267_A_G 0   2   0.0470348   0.0355603   0.0392442   0.259016    0.116379
chr1_732994 chr1    732994  chr1_732994_A_G 0   2   0.00695825  0.266467    0.0432277   0   0.152352
chr1_825042 chr1    825042  chr1_825042_T_C 3   1   0.000994036 0.127976    0.0706052   0.102874    0.0327198
chr1_826940 chr1    826940  chr1_826940_T_C 3   1   0.000994036 0.128968    0.0720461   0.104387    0.0327198

e-jorsboe commented 2 years ago

Hi, you have to replace the alleles in the .beagle file. So translate the numbers there into bases/letters.

It can be done using awk: awk '{if ($2=="0"){$2="A"} else if ($2=="1"){$2="C"} else if ($2=="2"){$2="G"} else if ($2=="3"){$2="T"} print }' file.beagle > file2.beagle > awk '{if ($3=="0"){$3="A"} else if ($3=="1"){$3="C"} else if ($3=="2"){$3="G"} else if ($3=="3"){$3="T"} print }' file2.beagle > file3.beagle

usevani commented 2 years ago

Encoding the bases as string in the beagle file did not change anything. By removing the "chr" prefix from the chromosome seems to have fixed not-valid-number issue. The tool is not recognizing the sites and overlapping correctly but it thinks the beagle nucleotides are not valid. I tried using the original beagle file with both 0,1,2,3 and A,T,G,C encoding but none of them worked.

Setup: -seed 1638911831 -method 1
Ploidy of 2 has been chosen

The accelerated EM has been chosen
The adjusted method has been chosen
Convergence: -maxIter 2000 -tol 0.00000010
The following number of bootstraps have been chosen: 0
Input has this many sites without missing data 1664
Ref has this many sites 1693
This many sites in ref are either not-valid-number or below maf in any of the chosen pops 0
Overlap: of 1145 sites between input and ref
Chosen pop EAS
Chosen pop AMR
Chosen pop AFR
Chosen pop SAS
Beagle nucleotide not valid must be A,C,G,T or 0,1,2,3

e-jorsboe commented 2 years ago

Hi there, so there are several issues. The refPanel file has CRLF line break or Windows line breaks ('\r' instead of '\n'). You have to code the chr as an integer. And you might have to translate the alleles in the .beagle file (not sure about this).

But I did the following modifications of the files, and now it is working with fastNGSadmix:

awk '{if ($2=="0"){$2="A"} else if ($2=="1"){$2="C"} else if ($2=="2"){$2="G"} else if ($2=="3"){$2="T"} print }' NA12878.1kg.chr22.angsd.beagle > tmp.beagle awk '{if ($3=="0"){$3="A"} else if ($3=="1"){$3="C"} else if ($3=="2"){$3="G"} else if ($3=="3"){$3="T"} print }' tmp.beagle > NA12878.1kg.chr22.angsd.withAlleles.beagle rm tmp.beagle

sed -e 's/chr2/2/g' NA12878.1kg.chr22.angsd.withAlleles.beagle > NA12878.1kg.chr22.angsd.withAlleles.chrFixed.beagle sed -e 's/chr2/2/g' refPanel_1kG_phase3_GRCh38.chr22.txt > refPanel_1kG_phase3_GRCh38.chr22.noChr.txt

cat refPanel_1kG_phase3_GRCh38.chr22.noChr.txt | sed 's/\r$//' > refPanel_1kG_phase3_GRCh38.chr22.noChr.lineEnds.txt

e-jorsboe commented 2 years ago

Hi did this solve the issue? Can I close the issue?

usevani commented 2 years ago

Sorry, I didn't get a chance to look. Just got back from vacation and will try it this week.

usevani commented 2 years ago

Sorry about the late response, I wanted to confirm the root cause before commenting. From all the testing I did it's clear that the root cause of the issue was the CRLF ('\r') line breaks in refPanel text file. None of the other changes like encoding the bases as integer or trimming the "chr" prefix of the chromosome name was needed. I also looked through the code and can confirm that beagle base encoding is converted to "char" within the program. In case you are curious, this happened because I was using the python csv library to generate the refPanel and the default line terminator is '\r\n'. Thank you for all your help in debugging this issue.

e-jorsboe commented 2 years ago

Ok, cool thanks for the feedback, I will try and fix this!

I assume I can close this issue?

usevani commented 2 years ago

Yes. Thanks!

e-jorsboe / fastNGSadmix

allele frequency in the reference panel #7