fgvieira / ngsF-HMM

Estimation of per-individual inbreeding tracts under a probabilistic framework
GNU General Public License v3.0
13 stars 6 forks source link

pos file from beagle/mafs #6

Open adeflamingh opened 2 years ago

adeflamingh commented 2 years ago

Hello, I'm trying to run ngs-HMM using GL files that were previously generated with ANGSD (this was the ANGSD command):

angsd -GL 1 -out gl_55 -nThreads 10 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-2 -minInd 28 -b bam.filelist -ref refXX.fasta

I have a beagle.gz and mafs.gz file, and a total number of 20563944 sites.

I saved the first two columns from the mafs file and removed the header to create the .pos file needed (cat X.mafs.gz | awk '{print $1,$3}' > X.pos, then tail -n +1 X.pos > X.nohead.pos)

I checked that the beagle and the no.head.pos files have 20563944 respectively using zcat X.beagle.gz | wc -l (this was 20563945, because of the header). and cat X.nohead.pos | wc -l)

Then I tried to use ngsF-HMM to calculate stats (assigning 260G memory and 8 threads), using:

ngsF-HMM --n_ind 55 --n_sites 20563944 --geno ../gl_55.beagle.gz --lkl --pos ../gl_55.nohead.pos --min_epsilon 1e-7 --out output/file --n_threads 8

I get following error:

> Header found! Skipping line...
==> Input Arguments:
        geno: ../gl_55.beagle.gz
        pos: ../gl_55.nohead.pos
        lkl: true
        loglkl: false
        n_ind: 55
        n_sites: 20563944
        call_geno: false
        freq: r
        freq_est: 1
        e_prob: 1
        indF: 0.01-0.001
        indF_fixed: false
        alpha_fixed: false
        out: output/file
        log: 0
        log_bin: false
        min_iters: 10
        max_iters: 100
        min_epsilon: 0.0000001000
        n_threads: 8
        verbose: 1
        seed: 383
        version: 1.1.0 (Mar 14 2022 @ 11:09:52)

==> GZIP input file (not BINARY)
==> Reading data
> Sites coordinates
> GENO data
==> Setting initial inbreeding values to: 0.01-0.001
==> Using random initial frequency values.
==> Calculating initial emission probabilities

Iteration 1:
==> Forward Recursion
==> Backward Recursion
Ind 0: -20723775.856501504778862        -20723775.854976736009121 (0.001524768769741)

=====
ERROR: [iter_EM] Fw and Bw lkl do not match!
=====

        : Numerical result out of range

Any recommendations on what I might be doing wrong? Thank you very much in advance!

fgvieira commented 2 years ago

I'd try running it again with other seeds and, if that does not work, set --min_epsilon 1e-9.

adeflamingh commented 2 years ago

Thank you very much for your suggestions! I tried a different seed, and also tried --min_epsilon 1e-9 but I get the same error. Any other fixes?

What I used:

"ngsF-HMM --n_ind 55 --n_sites 20563944 --geno ../gl_55.beagle.gz --lkl --pos ../gl_55.nohead.pos --min_epsilon 1e-9 --out output/file --n_threads 8 --seed 8376"

What I get:

Header found! Skipping line... ==> Input Arguments: geno: ../gl_55.beagle.gz pos: ../gl_55.nohead.pos lkl: true loglkl: false n_ind: 55 n_sites: 20563944 call_geno: false freq: r freq_est: 1 e_prob: 1 indF: 0.01-0.001 indF_fixed: false alpha_fixed: false out: output/file log: 0 log_bin: false min_iters: 10 max_iters: 100 min_epsilon: 0.0000000010 n_threads: 8 verbose: 1 seed: 8376 version: 1.1.0 (Mar 14 2022 @ 11:09:52)

==> GZIP input file (not BINARY) ==> Reading data

Sites coordinates GENO data ==> Setting initial inbreeding values to: 0.01-0.001 ==> Using random initial frequency values. ==> Calculating initial emission probabilities

Iteration 1: ==> Forward Recursion ==> Backward Recursion Ind 0: -20723821.353790059685707 -20723821.352281708270311 (0.001508351415396)

===== ERROR: [iter_EM] Fw and Bw lkl do not match!

    : Numerical result out of range
fgvieira commented 2 years ago

Then it looks like a precision issues due to large datasets, which seems to be the case since you have over 20M snps.

You can try to run on just a single chromosome, to check if that is the case.

Alternatively, you can try a "gentle" LD prunning (e.g. using ngsLD) just to remove sites that are in highly linked.

Let me know if it helped.