ANGSD / angsd

Program for analysing NGS data.
229 stars 50 forks source link

-dopost 3 / -dosaf 3 #64

Closed claudiuskerth closed 7 years ago

claudiuskerth commented 7 years ago

Hi,

is the estimation of genotype probabilties from posterior SAF's (instead of MAF's) already working or still in development? When I run the following command I get a segmentation fault:

angsd -bam bamfile.list -out TEST/ERY -doMajorMinor 1 -GL 1 -sites keep.sites -only_proper_pairs 0 -minmapq 5 -baq 1 -ref Big_Data_ref.fa -anc Big_Data_ref.fa -dopost 3 -dogeno 32 -dosaf 1 -pest TEST/Ery.folded.sfs

claudius

ANGSD commented 7 years ago

Not only should it work. It should work very well. For many samples it is however much more cpuintensive.

claudiuskerth commented 7 years ago

Hmm, could you maybe provide a sequence of commands, as examples, that show how to achieve this?

many thanks

claudius

ANGSD commented 7 years ago

Ok, I actually thought that there were an example on the wiki... Ill see if i have time later today to add some examples. Thanks for bringing this to my attention

On Tue, Dec 13, 2016 at 1:03 PM, Claudius Kerth notifications@github.com wrote:

Hmm, could you maybe provide a sequence of commands, as examples, that show how to achieve this?

many thanks

claudius

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/64#issuecomment-266721568, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7ti4B1kS4Ds4cbazSsOhqs-UJEcmks5rHomPgaJpZM4LLohZ .

claudiuskerth commented 7 years ago

That would be great. Thanks for the speedy reply!

claudius

ANGSD commented 7 years ago

Hey, I made an example on the genotype calling page. http://www.popgen.dk/angsd/index.php/Genotype_calling#Sample_allele_frequency_with_SFS_as_prior

Not sure where if it would make sense to have it on other wiki pages also. Where did you find it?

claudiuskerth commented 7 years ago

Hi Thorfinn,

thanks for that. This runs without error on my data. I also think it makes sense to put it on the page for genotype calling.

However, I am still a bit confused about what is actually going on. What is the SFS a prior for? The population allele frequencies or the sample allele frequencies?

From my (limited) understanding of Nielsen2012 I would have thought the SFS is prior for SAF likelihoods and the posterior SAF's are then turned into population allele frequency estimates in order to use HWE expected genotype frequencies as prior for individual genotype likelihoods. However, your command line with -domaf 1 (and no dosaf 1) suggests to me that the SFS is used directly as a prior for population allele frequency estimates?!

I would be grateful for a few hints in the right direction.

many thanks

claudius

claudiuskerth commented 7 years ago

Continuing from the above, I have just seen that you used the following command for your ANGSD paper:

angsd -sites sites.txt -b ceu31.list -doMajorMinor 4 -GL 1 -doSaf 3 -P 6 -ref hg19.fa -pest sfs.txt

No -domaf 1 in there. I assume -dosaf 3 has meanwhile been replace by -dopost 3.

claudius

ANGSD commented 7 years ago

Hi Claudius, i can see that this is confusing.

This genotype calling follows the nielsen2012 paper.

For the first couple of years all the analysis that relates to the nielsen2012 were put in the abcSaf.cpp class.

Then we decided to do some code refactoring such that the contents of the classes reflects what they are doing instead of which publication it was from.

So all the stuff that does calculate genotype likelihoods is in one class. One class that does sample allele frequencies which is the same one that calculates the genotype post probs.

So if you are calculating the genotype post prob using the sfs approach then you need to 'activate' the abcFreq class with the -doMaf option, even if it is not used in any analysis.

The proper approach here would be to split out the genotype calling/genotype post prob into its own class, then the user wouldn't need to supply -domaf, since it is not used in the genotypetype calling process.

Hope this helps. The short versions is that -domaf 1 is required due to the fact that we haven't had time to move the code to its own class but are piggy backing on the freq class

On Fri, Dec 16, 2016 at 6:08 PM, Claudius Kerth notifications@github.com wrote:

Continuing from the above, I have just seen that you used the following command for your ANGSD paper:

angsd -sites sites.txt -b ceu31.list -doMajorMinor 4 -GL 1 -doSaf 3 -P 6 -ref hg19.fa -pest sfs.txt

No -domaf 1 in there. I assume -dosaf 3 has meanwhile been replace by -dopost 3.

claudius

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/64#issuecomment-267643490, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7iAmtoF1c35lp0YbrUtc1J5obHF3ks5rIsWdgaJpZM4LLohZ .

claudiuskerth commented 7 years ago

Thank you Thorfinn!

That explains the -domaf in the command, but I am still confused about which type of allele frequency is used by this command: SAF or MAF. The command that you show on the wiki does not produce saf files, at least in my test, nor does it read them in, but a SFS only makes sense as prior for SAF's, right? Does it produce the SAF's on the fly, without saving them to disk?

Sorry for being so fussy. I just want to have a general understanding of the command before I use it.

claudius

ANGSD commented 7 years ago

It calculates the safs on the fly.

It is the implementation of equation13 in this paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0037558

It is quite cpu intensive. If you want to calculate the genotype probabilities for say 'N' individuals it has to calculate the safs for all N-1 subsets by excluding one individual at a time.

Best

On Mon, Dec 19, 2016 at 8:36 PM, Claudius Kerth notifications@github.com wrote:

Thank you Thorfinn!

That explains the -domaf in the command, but I am still confused about which type of allele frequency is used by this command: SAF or MAF. The command that you show on the wiki does not produce saf files, at least in my test, nor does it read them in, but a SFS only makes sense as prior for SAF's, right? Does it produce the SAF's on the fly, without saving them to disk?

Sorry for being so fussy. I just want to have a general understanding of the command before I use it.

claudius

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/64#issuecomment-268057147, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7r3dFbCXwVrK2WMrndBR0-UtvjmCks5rJtyigaJpZM4LLohZ .

claudiuskerth commented 7 years ago

That makes sense

thank you so much!

claudius

ANGSD commented 7 years ago

Hi Im closing this issue, feel free to reopen if needed