ANGSD / angsd

Program for analysing NGS data.
223 stars 51 forks source link

Problems with unsorted saf file chromoname #626

Open MboiTui opened 1 month ago

MboiTui commented 1 month ago

Hello,

I am using ANGSD and encounter a problem. I managed to produce .saf file successfully, and I managed to use those saf files to produce 1d sfs (with realSFS or winsfs) and 2d sfs (with winsfs, haven't tried with realSFS yet).

But, when I try to estimate Fst with 3 populations, I get the following output:

angsd v0.940 software is available in a singularity container.

Some aliases have been created for your convenience - few researchers
will ever need to use the singularity command directly.  To view the
aliases, use:
    module show angsd/0.940

[persaf::persaf_init] Version of 5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LM_unfolded.saf.idx is 3
[persaf::persaf_init] Assuming .saf.gz file is 5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LM_unfolded.saf.gz
[persaf::persaf_init] Assuming .saf.pos.gz file is 5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LM_unfolded.saf.pos.gz
[persaf::persaf_init] Version of 5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LS_unfolded.saf.idx is 3
[persaf::persaf_init] Assuming .saf.gz file is 5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LS_unfolded.saf.gz
[persaf::persaf_init] Assuming .saf.pos.gz file is 5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LS_unfolded.saf.pos.gz
[persaf::persaf_init] Version of 5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LNs_unfolded.saf.idx is 3
[persaf::persaf_init] Assuming .saf.gz file is 5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LNs_unfolded.saf.gz
[persaf::persaf_init] Assuming .saf.pos.gz file is 5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LNs_unfolded.saf.pos.gz
    -> args: tole:0.000000 nthreads:4 maxiter:100 nsites(block):0 start:5_4_unfolded/5_4_LMvsLS_unfolded.real.sfs chr:(null) start:-1 stop:-1 fstout:LMvsLSvsLNs.pbs oldout:0 seed:-1 bootstrap:0 resample_chr:0 whichFst:1 fold:0 ref:(null) anc:(null)
    -> nSites: 100000
    -> IMPORTANT: please make sure that your saf files haven't been folded with -fold 1 in -doSaf in angsd
    -> [bhatiaFst] sfs1:8 sfs2:14 dimspace:135 
    -> generating offset remapper lookup
    -> isSame:1 adjusting foldfactors
    -> [bhatiaFst] sfs1:8 sfs2:12 dimspace:117 
    -> generating offset remapper lookup
    -> isSame:1 adjusting foldfactors
    -> [bhatiaFst] sfs1:14 sfs2:12 dimspace:195 
    -> generating offset remapper lookup
    -> isSame:1 adjusting foldfactors
    -> Reading: 5_4_unfolded/5_4_LMvsLS_unfolded.real.sfs assuming counts (will normalize to probs internally)
    -> Reading: 5_4_unfolded/5_4_LMvsLNs_unfolded.real.sfs assuming counts (will normalize to probs internally)
    -> Reading: 5_4_unfolded/5_4_LSvsLNs_unfolded.real.sfs assuming counts (will normalize to probs internally)
    -> Done reading data from chromosome will prepare next chromosome
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Potential big problem, ordering of positions indicates that sites has not been sorted ? 
    -> chromoname: 'scaffold_1' pos[147253517]:271 vs pos[147253517-1]:344416033
    -> Consider running ./realSFS check your.saf.idx for each pop
    -> Potential big problem, ordering of positions indicates that sites has not been sorted ? 
    -> chromoname: 'scaffold_1' pos[294507034]:271 vs pos[294507034-1]:344416033
    -> Consider running ./realSFS check your.saf.idx for each pop
    -> Potential big problem, ordering of positions indicates that sites has not been sorted ? 
    -> chromoname: 'scaffold_1' pos[169716130]:271 vs pos[169716130-1]:344416033
    -> Consider running ./realSFS check your.saf.idx for each pop
    -> Potential big problem, ordering of positions indicates that sites has not been sorted ? 
    -> chromoname: 'scaffold_1' pos[339432260]:271 vs pos[339432260-1]:344416033
    -> Consider running ./realSFS check your.saf.idx for each pop
    -> Potential big problem, ordering of positions indicates that sites has not been sorted ? 
    -> chromoname: 'scaffold_1' pos[149309022]:271 vs pos[149309022-1]:344415640
    -> Consider running ./realSFS check your.saf.idx for each pop
    -> Potential big problem, ordering of positions indicates that sites has not been sorted ? 
    -> chromoname: 'scaffold_1' pos[298618044]:271 vs pos[298618044-1]:344415640
    -> Consider running ./realSFS check your.saf.idx for each pop
    -> Sites to keep[scaffold_1] from pop0: 5170011
    -> Sites to keep[scaffold_1] from pop1: 23449266
[ERROR](realSFS_shared.cpp:130) tsk[i]==tsk[i-1]

When I run realSFS check i get:

-> ./realSFS check your.saf.idx
    -> This will check that the positions are ordered correctly
[persaf::persaf_init] Version of 5_angsd/5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LM_unfolded.saf.idx is 3
[persaf::persaf_init] Assuming .saf.gz file is 5_angsd/5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LM_unfolded.saf.gz
[persaf::persaf_init] Assuming .saf.pos.gz file is 5_angsd/5_2_unfolded/5_2_minQ20_minInd3_rd6-dInd_scaf1Mb_LM_unfolded.saf.pos.gz
    -> args: tole:0.000000 nthreads:4 maxiter:100 nsites(block):0 start:(null) chr:(null) start:-1 stop:-1 fstout:(null) oldout:0 seed:-1 bootstrap:0 resample_chr:0 whichFst:0 fold:0 ref:(null) anc:(null)
    -> problems with unsorted saf file chromoname: 'scaffold_1' pos[147253517]:271 vs posd[147253517-1]:344416033
    -> problems with unsorted saf file chromoname: 'scaffold_1' pos[294507034]:271 vs posd[294507034-1]:344416033
    -> problems with unsorted saf file chromoname: 'scaffold_10' pos[52090370]:6579 vs posd[52090370-1]:118205165
    -> problems with unsorted saf file chromoname: 'scaffold_10' pos[104180740]:6579 vs posd[104180740-1]:118205165

I produced the saf file with the below command:

angsd -bam 5_angsd/5_0_popfiles/${pop}.bamlist \
    -ref ${REF} -anc ${ANC} -out 5_angsd/5_2_unfolded/input \
    -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
    -minMapQ 20 -minQ 20 -minInd 3 -setMinDepth 6 -setMaxDepth ${MAXDEPTH} \
    -GL 2 -doSaf 1 -doCounts 1 -P 4 -rf 5_2_Scaffolds1Mb

the 5_2_Scaffolds1Mb file looks like:

scaffold_1  1   344421834
scaffold_2  1   300751346
scaffold_3  1   273662429
scaffold_4  1   261171726
scaffold_5  1   222744812
scaffold_6  1   198822452
scaffold_7  1   165210402
scaffold_8  1   142255372
scaffold_9  1   124712869
scaffold_10 1   118207257
scaffold_11 1   90476049
scaffold_12 1   73896197
scaffold_13 1   63602140
scaffold_14 1   4739326
scaffold_15 1   3825199
scaffold_16 1   3637416
scaffold_17 1   2971943
scaffold_18 1   2739023
scaffold_19 1   2514448
scaffold_20 1   2502722
scaffold_21 1   2284445
scaffold_22 1   1891888
scaffold_23 1   1783337
scaffold_24 1   1673584
scaffold_25 1   1427651
scaffold_26 1   1262366
scaffold_27 1   1211415
scaffold_28 1   1188748
scaffold_29 1   1097499
scaffold_30 1   1069805
scaffold_31 1   1055864
scaffold_32 1   1052771
scaffold_2153   1   1725348

Which i used to restrict the analysis to scaffolds larger than 1Mb

I am stumped as to how I can fix this. There are no overlapping regions in my rf file. Do i need to sort the scaffold names alphabetically?

MboiTui commented 1 month ago

I ended up filtering the bam files to retain only the chromosomes of interest before running ANGSD, and when I run realSFS check my.saf.idx I get no errors, so I assume the issue was with the -rf option, as reported in other Issues.