ANGSD-wrapper / angsd-wrapper

Utilities for analyzing next generation sequencing data.
MIT License
14 stars 4 forks source link

[Problem opening file: '-fold'] --> angst-wrapper SFS ./Site_Frequency_Spectrum_Config in online tutorial #11

Open squisquater opened 2 years ago

squisquater commented 2 years ago

I followed the installation instructions and the tutorial instructions but am running into an error when I try to run the Site Frequency Spectrum.

angsd-wrapper SFS ./Site_Frequency_Spectrum_Config

This is my output and it appears it's failing when trying to find the file needed to fold (or not fold) the spectrum, but it can't.

WRAPPER: Zipping advanced arguments onto basic ones

        -> angsd version: 0.911-44-g1c0ebb6 (htslib: 1.3.1-30-gbb03b02) build(Oct 31 2021 11:04:52)
        -> Reading fasta: /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa
        -> Reading fasta: /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
        -> (Using Filipe G Vieira modification of: abcSaf.cpp)
        -> Parsing 11 number of samples
        -> Region lookup 1/1

        -> We have now allocated approximately 10 Megabytes of raw nodes to the nodepool
        -> Printing at chr: 10 pos:17551496 chunknumber 1100
        -> We have now allocated approximately 20 Megabytes of raw nodes to the nodepool
        -> Printing at chr: 10 pos:19386992 chunknumber 2000 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=11
        -> Printing at chr: 10 pos:22395913 chunknumber 3200 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
        -> Printing at chr: 10 pos:24004662 chunknumber 3600 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=11
        -> Printing at chr: 10 pos:24908040 chunknumber 4000
        -> Done reading data waiting for calculations to finish
        -> Done waiting for threads

        -> npools:26 unfreed tnodes before clean:0
        -> Output filenames:
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.arg"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.mafs.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.geno.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.pos.gz"
                ->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.idx"
        -> Sun Oct 31 12:08:56 2021
        -> Arguments and parameters for all analysis are located in .arg file
        [ALL done] cpu-time used =  199.08 sec
        [ALL done] walltime used =  130.00 sec
        -> Version of fname:/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.idx is:2
        -> Assuming .saf.gz file: /mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.gz
        -> Assuming .saf.pos.gz: /mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.pos.gz
        -> Problem opening file: '-fold'

Looking at the wrapper shell script (Site_Frequency_Spectrum.sh) it appears that is failing in the final section of the script in the middle of a series of pipes to the final file which does get output in my scratch directory, it's just empty.

#!/usr/bin/env bash

set -e
set -o pipefail

#   Load variables from supplied config file
source "$1"

#   Are we using Common_Config? If so, source it
if [[ -f "${COMMON}" ]]
then
    source "${COMMON}"
fi

#   Where is angsd-wrapper located?
SOURCE=$2

#   Where is ANGSD?
ANGSD_DIR=${SOURCE}/dependencies/angsd

#   Variables created from transforming other variables
#       The number of individuals in the taxon we are analyzing
N_IND=$(wc -l < "${SAMPLE_LIST}")
#       How many inbreeding coefficients are supplied?
N_F=$(wc -l < "${SAMPLE_INBREEDING}")
#       For ANGSD, the actual sample size is twice the number of individuals, since each individual has two chromosomes.
#       The individual inbreeding coefficents take care of the mismatch between these two numbers

#   Perform a check to see if number of individuals matches number of inbreeding coefficients
if [ "${N_IND}" -ne "${N_F}" ]
then
    echo "Mismatch between number of samples in ${SAMPLE_LIST} and ${SAMPLE_INBREEDING}"
    exit 1
fi

#   Check to see if ancestral state is supplied: If not, polarize samples using
#   the reference sequence and generate folded saf.
if [ ! -f "${ANC_SEQ}" ]
then
    echo "Ancestral state data not found, using reference sequence to polarize alignment data. BAQ will likewise not be calculated."
    if [ ! -f "${REF_SEQ}" ]
    then
        echo "No reference sequence supplied, unable to perform calculations."
        exit 2
    else
        ANC_SEQ=$REF_SEQ
        REF_SEQ=
        BAQ=0
        FOLD=1
    fi
else
    FOLD=0
fi

#   Create outdirectory
OUT="${SCRATCH}"/"${PROJECT}"/SFS
mkdir -p "${OUT}"

#   Now we actually run the command, this creates a binary file that contains the prior SFS
if [[ -f "${OUT}"/"${PROJECT}"_SFSOut.mafs.gz ]] && [ "$OVERRIDE" = "false" ]
then
    echo "WRAPPER:maf already exists and OVERRIDE=false, skipping angsd -bam..."
else
    #   Do we have a regions file?
    if [[ -f "${REGIONS}" ]]
    then
    WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
            -out "${OUT}"/"${PROJECT}"_SFSOut \
            -indF "${SAMPLE_INBREEDING}" \
            -doSaf "${DO_SAF}" \
            -uniqueOnly "${UNIQUE_ONLY}" \
            -anc "${ANC_SEQ}" \
            -minMapQ "${MIN_MAPQ}" \
            -minQ "${MIN_BASEQUAL}" \
            -nInd "${N_IND}" \
            -minInd "${MIN_IND}"\
            -baq "${BAQ}" \
            -ref "${REF_SEQ}" \
            -GL "${GT_LIKELIHOOD}" \
            -P "${N_CORES}" \
            -doMajorMinor "${DO_MAJORMINOR}" \
            -doMaf "${DO_MAF}" \
            -doGeno "${DO_GENO}" \
            -rf "${REGIONS}" \
            -doPost "${DO_POST}")
    #   Are we missing a definiton for regions?
    elif [[ -z "${REGIONS}" ]]
    then
    WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
            -out "${OUT}"/"${PROJECT}"_SFSOut \
            -indF "${SAMPLE_INBREEDING}" \
            -doSaf "${DO_SAF}" \
            -uniqueOnly "${UNIQUE_ONLY}" \
            -anc "${ANC_SEQ}" \
            -minMapQ "${MIN_MAPQ}" \
            -minQ "${MIN_BASEQUAL}" \
            -nInd "${N_IND}" \
            -minInd "${MIN_IND}"\
            -baq "${BAQ}" \
            -ref "${REF_SEQ}" \
            -GL "${GT_LIKELIHOOD}" \
            -P "${N_CORES}" \
            -doMajorMinor "${DO_MAJORMINOR}" \
            -doMaf "${DO_MAF}" \
            -doGeno "${DO_GENO}" \
            -doPost "${DO_POST}")
    #   Assuming a single region was defined in config file
    else
    WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
            -out "${OUT}"/"${PROJECT}"_SFSOut \
            -indF "${SAMPLE_INBREEDING}" \
            -doSaf "${DO_SAF}" \
            -uniqueOnly "${UNIQUE_ONLY}" \
            -anc "${ANC_SEQ}" \
            -folded "${FOLD}" \
            -minMapQ "${MIN_MAPQ}" \
            -minQ "${MIN_BASEQUAL}" \
            -nInd "${N_IND}" \
            -minInd "${MIN_IND}" \
            -baq "${BAQ}" \
            -ref "${REF_SEQ}" \
            -GL "${GT_LIKELIHOOD}" \
            -P "${N_CORES}" \
            -doMajorMinor "${DO_MAJORMINOR}" \
            -doMaf "${DO_MAF}" \
            -doGeno "${DO_GENO}" \
            -doPost "${DO_POST}" \
            -r "${REGIONS}")
    fi
fi
# Check for advanced arguments, and overwrite any overlapping definitions
FINAL_ARGS=($(source "${SOURCE}/Wrappers/Arg_Zipper.sh" "${WRAPPER_ARGS}" "${ADVANCED_ARGS}"))
# DEBUGGING
# echo "Wrapper arguments: ${WRAPPER_ARGS}" 1<&2
# echo -e "Final arguments:" ${FINAL_ARGS} 1<&2

"${ANGSD_DIR}"/angsd "${FINAL_ARGS[@]}"

"${ANGSD_DIR}"/misc/realSFS \
    "${OUT}"/"${PROJECT}"_SFSOut.saf.idx \
    -P "${N_CORES}" \
    -fold "${FOLD}" \
    > "${OUT}"/"${PROJECT}"_DerivedSFS.graph.me`

I can also include my configuration file if helpful (Site_Frequency_Spectrum_Config) which also directs the script to another configuration file in the same directory (Common_Config), but I'm wondering whether anyone else has run into this error while trying to move through this tutorial before. I am trying to figure out if this is a file path issue or if the SFS is not running correctly and there is some other error in the output file I am not identifying correctly.

squisquater commented 2 years ago

UPDATE: I think there is something wrong with my Maize_SFSOut.saf.idx file but I'm not sure what the issue is. Specifically "Only read nSites: 0 will therefore prepare next chromosome (or exit)"

I found some related threads that suggest filtering is too stringent but since this is all based on the angsd-wrapper tutorial that doesn't seem the likely culprit.

/mnt/steelhead/remote/Sophie/scratch/Maize/SFS$ /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/dependencies/angsd/misc/realSFS ./Maize_SFSOut.saf.idx
        -> Version of fname:./Maize_SFSOut.saf.idx is:2
        -> Assuming .saf.gz file: ./Maize_SFSOut.saf.gz
        -> Assuming .saf.pos.gz: ./Maize_SFSOut.saf.pos.gz
        -> args: tole:0.000001 nthreads:4 maxiter:100 nsites:0 start:(null) chr:(null) start:-1 stop:-1 fname:./Maize_SFSOut.saf.idx fstout:(null) oldout:0 seed:1635892830 bootstrap:0
        -> You are printing the optimized SFS to the terminal consider dumping into a file
        -> E.g.: './realSFS ./Maize_SFSOut.saf.idx >sfs.ml.txt'
        -> nSites: 873481
        -> The choice of -nSites will require atleast: 83.301636 megabyte memory, that is at least: 1.05% of total memory
        -> dim(./Maize_SFSOut.saf.idx):23
        -> Dimension of parameter space: 23
        -> Done reading data from chromosome will prepare next chromosome
        -> Is in multi sfs, will now read data from chr:10
        -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect
        -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
        -> Will run optimization on nSites: 873481
------------
startlik=-3393452.567606
lik[2]=-225047.521432 diff=3.168405e+06 alpha:1.000000 sr2:9.277772e-01
lik[5]=-225010.243117 diff=3.727831e+01 alpha:1.284184 sr2:1.180093e-07
lik[8]=-225008.303606 diff=1.939511e+00 alpha:3.949081 sr2:1.254134e-09
lik[11]=-225008.300234 diff=3.372127e-03 alpha:1.904168 sr2:4.637274e-12
        -> Breaking EM(sr2) at iter:12, sqrt(sr2):2.292394e-07
likelihood: -225008.300234
------------
837794.500918 0.000000 5871.665062 0.000000 3235.764685 0.000000 1832.887050 0.000000 1379.652548 0.000000 1391.708114 0.000000 1219.220259 0.000000 938.974705 0.000000 881.494241 0.000000 1172.368006 0.000000 1678.848464 0.000000 16083.915949
        -> Only read nSites: 0 will therefore prepare next chromosome (or exit)

        -> NB NB output is no longer log probs of the frequency spectrum!
        -> Output is now simply the expected values!
        -> You can convert to the old format simply with log(norm(x))
carte731 commented 2 years ago

I'll look into it - Your config file might help with troubleshooting.

jrmandel commented 2 years ago

Hi all, I also have the same issue with my own data: Problem opening file: '-fold' after running SFS.

My _SFSOut.saf.idx file also contains:


-> Only read nSites: 0 will therefore prepare next chromosome (or exit)
    -> NB NB output is no longer log probs of the frequency spectrum!
    -> Output is now simply the expected values! 
    -> You can convert to the old format simply with log(norm(x))
wul88 commented 2 years ago

I also encounter the same problem when running the tutorial (Example_Data/Maize). Here is output message: angsd-wrapper running from /gpfs/group/restricted/drc9/restricted/Lan/dotfiles/angsd-wrapper


WRAPPER: Zipping advanced arguments onto basic ones

-> angsd version: 0.911-44-g1c0ebb6 (htslib: 1.3.1-30-gbb03b02) build(Sep 14 2022 09:44:04)
-> Reading fasta: /gpfs/group/restricted/drc9/restricted/Lan/dotfiles/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa
-> Reading fasta: /gpfs/group/restricted/drc9/restricted/Lan/dotfiles/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
-> (Using Filipe G Vieira modification of: abcSaf.cpp)
-> Parsing 11 number of samples
-> Region lookup 1/1

-> We have now allocated approximately 10 Megabytes of raw nodes to the nodepool

-> We have now allocated approximately 20 Megabytes of raw nodes to the nodepool
-> Printing at chr: 10 pos:15302669 chunknumber 100
-> Printing at chr: 10 pos:15780341 chunknumber 200
-> Printing at chr: 10 pos:16322598 chunknumber 300
-> Printing at chr: 10 pos:16492300 chunknumber 400
-> Printing at chr: 10 pos:16660947 chunknumber 500
-> Printing at chr: 10 pos:16771873 chunknumber 600
-> Printing at chr: 10 pos:16943362 chunknumber 700
-> Printing at chr: 10 pos:17114919 chunknumber 800
-> Printing at chr: 10 pos:17264910 chunknumber 900
-> Printing at chr: 10 pos:17405281 chunknumber 1000
-> Printing at chr: 10 pos:17551496 chunknumber 1100
-> Printing at chr: 10 pos:17759315 chunknumber 1200
-> Printing at chr: 10 pos:17906277 chunknumber 1300
-> Printing at chr: 10 pos:18074909 chunknumber 1400
-> Printing at chr: 10 pos:18231910 chunknumber 1500
-> Printing at chr: 10 pos:18354120 chunknumber 1600
-> Printing at chr: 10 pos:18713498 chunknumber 1700
-> Printing at chr: 10 pos:18966706 chunknumber 1800
-> Printing at chr: 10 pos:19170751 chunknumber 1900
-> Printing at chr: 10 pos:19386992 chunknumber 2000

[emFrequency_F] caught nan will not exit logLike (3nInd). nInd=11 keepList (nInd) used logLike (3length(keep))=11 -> Printing at chr: 10 pos:19526855 chunknumber 2100 -> Printing at chr: 10 pos:19622754 chunknumber 2200 -> Printing at chr: 10 pos:19787284 chunknumber 2300 -> Printing at chr: 10 pos:19985193 chunknumber 2400 -> Printing at chr: 10 pos:20417329 chunknumber 2500 -> Printing at chr: 10 pos:20706518 chunknumber 2600 -> Printing at chr: 10 pos:20898012 chunknumber 2700 -> Printing at chr: 10 pos:21081629 chunknumber 2800 -> Printing at chr: 10 pos:21574904 chunknumber 2900 -> Printing at chr: 10 pos:21732294 chunknumber 3000 -> Printing at chr: 10 pos:22185942 chunknumber 3100 -> Printing at chr: 10 pos:22395913 chunknumber 3200 [emFrequency_F] caught nan will not exit logLike (3nInd). nInd=11 keepList (nInd) used logLike (3length(keep))=10 [emFrequency_F] caught nan will not exit logLike (3nInd). nInd=11 keepList (nInd) used logLike (3length(keep))=10 [emFrequency_F] caught nan will not exit logLike (3nInd). nInd=11 keepList (nInd) used logLike (3length(keep))=10 -> Printing at chr: 10 pos:22763312 chunknumber 3300 -> Printing at chr: 10 pos:23074752 chunknumber 3400 -> Printing at chr: 10 pos:23410690 chunknumber 3500 -> Printing at chr: 10 pos:24004662 chunknumber 3600 [emFrequency_F] caught nan will not exit logLike (3nInd). nInd=11 keepList (nInd) used logLike (3length(keep))=11 -> Printing at chr: 10 pos:24104091 chunknumber 3700 -> Printing at chr: 10 pos:24449175 chunknumber 3800 -> Printing at chr: 10 pos:24798701 chunknumber 3900 -> Printing at chr: 10 pos:24908040 chunknumber 4000

-> Done reading data waiting for calculations to finish
-> Done waiting for threads

-> npools:27 unfreed tnodes before clean:0
-> Output filenames:
    ->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.arg"
    ->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.mafs.gz"
    ->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.geno.gz"
    ->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.gz"
    ->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.pos.gz"
    ->"/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.idx"
-> Wed Sep 14 17:12:37 2022
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used =  344.06 sec
[ALL done] walltime used =  217.00 sec
-> Version of fname:/gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.idx is:2
-> Assuming .saf.gz file: /gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.gz
-> Assuming .saf.pos.gz: /gpfs/scratch/lxw34/lowCov/angsd_wrapper/Maize/SFS/Maize_SFSOut.saf.pos.gz
-> Problem opening file: '-fold'

SFS/Maize_SFSOut.saf.idx is not right. It only contains one line as follows: safv3

SFS/Maize_DerivedSFS.graph.me is of size of 0. I don't know how to check the other binary files.

Has anyone resolved this problem yet? Please help.

Thanks!