ANGSD / angsd

Program for analysing NGS data.
230 stars 51 forks source link

PROBS at positions #570

Open ghost opened 1 year ago

ghost commented 1 year ago

Hello,

I recently encountered an error with "PROBS at: XXXX" I am generating a SAF with --doSAF 1 for a 2D sfs with the following command:

angsd -b "$BAMS"
-r chr01 -P 10 -dosaf 1 \
-anc "$REFERENCE_FASTA" \
-minMapQ 25 -minQ 25 -remove_bads 1 \
-GL 1 -out saf1_out1

I am doing this for 12 populations, with 2 lineages at each populations so I am generating a saf for each lineage at each population (total files = 24). Some populations don't have any "PROBS" while the highest number of "PROBS at:" is approximately 600. I checked the bams using samtools tview and the sites are SNPs with decent coverage (anywhere from 4x to 15x). I noticed some of these sites are triallelic. How does dosaf behave when a site is triallelic?

Any suggestions would be appreciated.

Edit: I also noticed that these sites tend to use the IUPAC Ambiguity Codes for some calls. This might be a coincidence too.

jbruxaux commented 10 months ago

Hi!

I have exactly the same issue, and if I used ANGSD a lot before, it's my first try with the -r option. So I wonder if it's not linked to that... But my reference genome is so big for this project, that I can't really imagine not using it... As an example of one output I have:

[...]
        -> Allocated ~ 70 million nodes to the nodepool, this is not an estimate of the memory usage

        -> Allocated ~ 80 million nodes to the nodepool, this is not an estimate of the memory usage

        -> Allocated ~ 90 million nodes to the nodepool, this is not an estimate of the memory usage
PROBS at: part4 4830
PROBS at: part4 21515
PROBS at: part4 67548
PROBS at: part4 70451
PROBS at: part4 79098
        -> Printing at chr: part4 pos:80138 chunknumber 100 contains 586 sites
PROBS at: part4 97977
PROBS at: part4 115440
PROBS at: part4 119794
PROBS at: part4 125387
PROBS at: part4 148914
PROBS at: part4 154684
PROBS at: part4 163107
PROBS at: part4 168893
PROBS at: part4 184983
        -> Printing at chr: part4 pos:195941 chunknumber 200 contains 956 sites
[...]

Any comment from the developers would be really appreciated! :-)

ANGSD commented 10 months ago

The difference between -r and -sites is that one is using active lookup by using the index for each bam/cram whereas the nsites limits the analyses and printout.

The probs issue is normally caused by the ref allele not being ancestral or derived.

On 10 Jan 2024, at 14.47, jbruxaux @.***> wrote:

Hi!

I have exactly the same issue, and if I used ANGSD a lot before, it's my first try with the -r option. So I wonder if it's not linked to that... But my reference genome is so big for this project, that I can't really imagine not using it... As an example of one output I have:

[...] -> Allocated ~ 70 million nodes to the nodepool, this is not an estimate of the memory usage

    -> Allocated ~ 80 million nodes to the nodepool, this is not an estimate of the memory usage

    -> Allocated ~ 90 million nodes to the nodepool, this is not an estimate of the memory usage

PROBS at: part4 4830 PROBS at: part4 21515 PROBS at: part4 67548 PROBS at: part4 70451 PROBS at: part4 79098 -> Printing at chr: part4 pos:80138 chunknumber 100 contains 586 sites PROBS at: part4 97977 PROBS at: part4 115440 PROBS at: part4 119794 PROBS at: part4 125387 PROBS at: part4 148914 PROBS at: part4 154684 PROBS at: part4 163107 PROBS at: part4 168893 PROBS at: part4 184983 -> Printing at chr: part4 pos:195941 chunknumber 200 contains 956 sites [...] Any comment from the developers would be really appreciated! :-)

— Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/570#issuecomment-1884882923, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3QXKUAYVMRW6MQK5LDYN2LYXAVCNFSM6AAAAAAXSAFCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUHA4DEOJSGM. You are receiving this because you are subscribed to this thread.

jbruxaux commented 10 months ago

Thanks a lot for your very fast answer! Since I used the same reference for mapping and as an ancestral genome in the -dosaf analysis, I assume the error must come from the fact that the reference is too distant? And therefore the ref allele (which is the major allele in the dataset) can be different from the ancestral sequence. In that case, I do not have to worry and can continue my analysis. Am I right?

ANGSD commented 10 months ago

yes, i believe you can skip those sites, which is exactly what is happening in the code, it just informs you, maybe it is abit too verbose and maybe the message should be more clear. You are not the first one that has been concerned about it.

Best regards.

On 10 Jan 2024, at 15.35, jbruxaux @.***> wrote:

Thanks a lot for your very fast answer! Since I used the same reference for mapping and as an ancestral genome in the -dosaf analysis, I assume the error must come from the fact that the reference is too distant? And therefore the ref allele (which is the major allele in the dataset) can be different from the ancestral sequence. In that case, I do not have to worry and can continue my analysis. Am I right?

— Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/570#issuecomment-1884966204, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3RHA4PQPCSIAY6W2XTYN2RKLAVCNFSM6AAAAAAXSAFCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUHE3DMMRQGQ. You are receiving this because you commented.