fgvieira / ngsF-HMM

Estimation of per-individual inbreeding tracts under a probabilistic framework
GNU General Public License v3.0
13 stars 6 forks source link

ERROR: [__FUNCTION__] invalid allele frequencies in 1st iteration of ngsF-HMM #7

Open wul88 opened 1 year ago

wul88 commented 1 year ago

Hello, fgvieira

I tried to run ngsF-HMM to find IBD of samples from different sub-populations. I used the Beagle GL files computed previously as below:

angsd -bam bam_All118.list -ref ${refSeq} -out ${outDir}/${outMaf} \
    -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -C 50 -baq 1 \
    -minMapQ 30 -minQ 30 -setMinDepth 106 -setMaxDepth 1954 -minmaf 0.05 -doCounts 1 \
    -GL 2 -doMaf 1 -skipTriallelic 1 -doGlf 2 -SNP_pval 1e-6 -doMajorMinor 1 -P 8

I have .beagle.gz and .mafs.gz files. I extracted first two columns from .mafs.gz file and removed the header, then saved as pos file. The nsites was computed by "cat $pos | wc -l " . The command to run ngsF-HMM is as follows:

ngsF-HMM --geno ${glf2} --lkl --pos ${pos} --n_ind $NSAMS --n_sites $NSITES --freq 0.1 --freq_est 2 \
    --indF inF_transitPar_118AveAuto.txt --out ${outDir}/${out} --n_threads 8 --seed ${thisSeed} \
    --min_epsilon 1e-9 --verbose 2 --log 1

WHERE glf2 is the Beagle format glf from angsd run shown at the beginning, and inF_transitPar_118AveAuto.txt is a two-column file: 1st column is the inbreeding coefficient from running ngsF and the second column is 0.1 as transition parameter. one row per individual. ( I also used 0.1, 0.1 as shown in tutorial in different run, received the same error.) AND thisSeed=$RANDOM glf2="all118_MafGlf2_allChrs.beagle.gz"

maf="all118_MafGlf2_allChrs.mafs.gz"

pos="all118_MafGlf2_allChrs_noHeader.pos" NSAMS=118 NSITES=$((cat ${pos} | wc -l))

outDir=ngsFHMM/"rep_"$thisRep out="all118depthSetinFHMM"$thisRep ( thisRep is replicate number. I ran 10 replicates but every one had the same error.)

--- one of the .err files ---->

> Header found! Skipping line...

=====
ERROR: [__FUNCTION__] invalid allele frequencies
=====

    : Numerical argument out of domain

-----  one of the .out file ----->
==> Input Arguments:
    geno: all118_MafGlf2_allChrs.beagle.gz
    pos: all118_MafGlf2_allChrs_noHeader.pos
    lkl: true
    loglkl: false
    n_ind: 118
    n_sites: 2205512
    call_geno: false
    freq: 0.1
    freq_est: 2
    e_prob: 1
    indF: inF_transitPar_118AveAuto.txt
    indF_fixed: false
    alpha_fixed: false
    out: ngsFHMM/rep_2/all118depthSet_inFHMM_2
    log: 1
    log_bin: false
    min_iters: 10
    max_iters: 100
    min_epsilon: 0.0000000010
    n_threads: 8
    verbose: 2
    seed: 18950
    version: 1.1.0 (Jan  4 2023 @ 20:27:07)

==> GZIP input file (not BINARY)
==> Reading data
> Sites coordinates
> GENO data
==> Reading initial inbreeding values from file "inF_transitPar_118AveAuto.txt".
==> Setting initial frequency values to: 0.1
==> Calculating initial emission probabilities
==> Printing current iteration parameters

Iteration 1:
==> Forward Recursion
==> Backward Recursion
==> Marginal probabilities
==> Update inbreeding and transition parameter
==> Estimating allele frequencies and calculating emission probabilities

Could you please give me some advice as what I did wrong? Thank you very much in advance!

fgvieira commented 1 year ago

Could you try running with --freq_est 1?

wul88 commented 1 year ago

Hi, fgvieira,

Thank you for the quick response. The option of "--freq_est 1" did run. It took over 3 days to finish. I am checking the inbreeding coefficient data now. Meanwhile I have some questions on these runs as welll as using ngsF-HMMplot.R.

  1. This time I only ran 3 replicates that used random seeds, but all three replicate runs generated the exact same results (3 output files each run. I used "diff" command to compare the output of different replicate run) Is this a suspicious behavior that indicates something not right?

Here is the command I used:

ngsF-HMM --geno ${glf2} --lkl --pos ${pos} --n_ind $NSAMS --n_sites $NSITES --freq 0.1 --freq_est 1 \
    --indF 0.1,0.1 --out ${outDir}/${out} --n_threads 8 --seed ${thisSeed} \
    --min_epsilon 1e-9 --verbose 2 --log 1

where:

thisSeed=$RANDOM
glf2="all118_MafGlf2_allChrs.beagle.gz"
pos="all118_MafGlf2_allChrs_noHeader.pos"
NSAMS=118
outDir=ngsFHMM/"rep_"$thisRep
thisRep=$rep     # ($rep -  number of replicate)
  1. I noticed that in the ngsF-HMM/examples/test.sh running ngsF-HMMplot.R as follows:
    Rscript ../scripts/ngsF-HMMplot.R --in_file testF-HMM.$ID.$TYPE.ibd --n_ind $N_IND --n_sites $N_SITES --geno testF-HMM.SIM.geno.gz --path testF-HMM.SIM.path.gz --pos testF-HMM.SIM.pos.gz --marg_prob --out testF-HMM.$ID.$TYPE.pdf

    I don't have a .path.gz file in my output. What is that file? How can I produce one? Thank you in advance for your help!

fgvieira commented 1 year ago

Hi @wul88 and sorry for the late reply.

  1. the random seed only has an effect with random starting values
  2. the --path option just adds another track to the plots. I used it to show the "truth", but can also be used to plot some regions of interest. In your case, you can omit it.
wul88 commented 1 year ago

Hi, fgvieira, Thank you for your help! The plot script worked. I have over 100 samples, loading pdf of 15 pages becomes time consuming. Any suggestion on break-up result?

As running ngsF-HMM, I have following concerns. Please advise me.

  1. I have ran with --indF "0.1,0.1" or "r" twice. The results were the same. It seemed that initial values of indF does not have much effect on the final results. I checked the results from all replicate runs, they are the same. The random seeds also has no effect on the algorithm. Is this normal? Or I did something wrong? Here is the command I used:

    nSite=$((`zcat $maf | tail -n+2 | wc -l`))
    ngsF-HMM --geno ${glf2} --lkl --pos ${pos} --n_ind ${nInd} --n_sites ${nSite} --freq e --freq_est 1 \
    --indF r --out ${outDir}/${out} --n_threads 8 --seed ${thisSeed} \
    --min_epsilon 1e-9 --verbose 2 --min_iters 20 --max_iters 400
  2. I noticed that the inbreeding coefficients in the output .indF file, were quite different from those from ngsF. In addition, the transition parameters for individuals in .indF file were all "10". What does that parameter mean? Is "10" reasonable?

  3. with --log 1 did not produce .log file.

fgvieira commented 1 year ago
  1. Can you try using --indF "0.1-0.1"
  2. . The estimates could be a bit different, but it can also be a convergence issue. Have you tried running ngsF-HMM with the estimates from ngsF as starting values? The transition parameter represents how often the state changes between IBD and notIBD; high numbers reflect short IBD fragments. What values of F are you getting from ngsF and ngsF-HMM?
  3. This option can be a bit misleading, but it means that it saves the output every X iterations, but it overwrites the file (it is mostly for debug purposes).