Fst window output shows only weighted Fst, not unweighted.

ANGSD / angsd

Program for analysing NGS data.

230 stars 50 forks source link

Fst window output shows only weighted Fst, not unweighted. #92

Closed johnburley3000 closed 7 years ago

johnburley3000 commented 7 years ago

Hi all,

Wondering if I can have some assistance to extract unweighted Fst in windows from my low-coverage population genomic data. I have followed the realSFS method to estimate Fst values from population pairs, and I've also used the 3 population option to get population branch stats (PBS).

I'll describe the problem using 2 populations: The command, $realSFS fst stats ${OUTDIR}/${P1}_${P2}_${REGIONS}.fst.idx > ${OUTDIR}/${P1}_${P2}_${REGIONS}.fst.txt outputs estimates of unweighted and weighted Fst, the latter being considerably higher (i.e. 0.1 compared to 0.5 for populations that I know are very similar) (the difference between unweighted and weighted Fst are described here: https://github.com/ANGSD/angsd/issues/16). Then, the command: $realSFS fst stats2 ${OUTDIR}/${P1}_${P2}_${REGIONS}.fst.idx -win 25000 -step 25000 > ${OUTDIR}/${P1}_${P2}_${REGIONS}_25kbwinfst.txt only retrieves the weighted Fst in windows. I also tried -whichFst 1.

I would prefer to use the unweighted (lower) estimate of Fst, averaged across windows, but can't figure out how to extract this. The weighted Fst values seem unrealistically high, and so the Fst plots are a noisy mess of points between 0-1.

I think this quote from the instructions on how to get PBS/FST is relevant: "where columns are: chromosome, position, (a), (a+b) values for the three FST comparisons, where FST is defined as a/(a+b). Note that FST on multiple SNPs is calculated as sum(a)/sum(a+b)." - https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md

I am a bit confused by this - is it that the weighted estimate is used whenever there are multiple SNPs per window? This would be the case in every window, and would explain why I only have weighted Fst in the window output.

In conclusion: Is there any way to retrieve unweighted Fst in windows, using ANGSD??

Thanks, John

ANGSD commented 7 years ago

Hi John, as i remember people found that the unweighted was much to noisy, so we removed it from the output. You can uncomment this line: https://github.com/ANGSD/angsd/blob/master/misc/safstat.cpp#L299

Or you might want to copy unweight[i]/(1.0*nObs) into fstW[i]. To use the existing framework etc.

On Wed, Jul 19, 2017 at 2:21 AM, johnburley3000 notifications@github.com wrote:

Hi all,

Wondering if I can have some assistance to extract unweighted Fst in windows from my low-coverage population genomic data. I have followed the realSFS method to estimate Fst values from population pairs, and I've also used the 3 population option to get population branch stats (PBS).

I'll describe the problem using 2 populations: The command, $realSFS fst stats ${OUTDIR}/${P1}${P2}${REGIONS}.fst.idx > ${OUTDIR}/${P1}${P2}${REGIONS}.fst.txt outputs estimates of unweighted and weighted Fst, the latter being considerably higher (i.e. 0.1 compared to 0.5 for populations that I know are very similar) (the difference between unweighted and weighted Fst are described here: #16 https://github.com/ANGSD/angsd/issues/16). Then, the command: $realSFS fst stats2 ${OUTDIR}/${P1}${P2}${REGIONS}.fst.idx -win 25000 -step 25000 > ${OUTDIR}/${P1}${P2}${REGIONS}_25kbwinfst.txt only retrieves the weighted Fst in windows. I also tried -whichFst 1.

I would prefer to use the unweighted (lower) estimate of Fst, averaged across windows, but can't figure out how to extract this. The weighted Fst values seem unrealistically high, and so the Fst plots are a noisy mess of points between 0-1.

I think this quote from the instructions on how to get PBS/FST is relevant: "where columns are: chromosome, position, (a), (a+b) values for the three FST comparisons, where FST is defined as a/(a+b). Note that FST on multiple SNPs is calculated as sum(a)/sum(a+b)." - https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md

I am a bit confused by this - is it that the weighted estimate is used whenever there are multiple SNPs per window? This would be the case in every window, and would explain why I only have weighted Fst in the window output.

In conclusion: Is there any way to retrieve unweighted Fst in windows, using ANGSD??

Thanks, John

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7t-REqkoHh-r0nzbmtAFPealeVpVks5sPVn_gaJpZM4OcHQw .

johnburley3000 commented 7 years ago

Hi, thanks for your help with this. I was able to get the unweighted output following your recommendation. After looking into this some more, I think that the main reason for my noisy Fst scan is actually that I was using small window sizes. It seems there is a just lot of variation in fst across the genome for the paired populations I am examining.

I am now having an issue with -doAbbababa, and I would really appreciate any help, for I can't find other examples online besides in your instructions that I am trying to follow ( http://www.popgen.dk/angsd/index.php/Abbababa).

The .abbababa outputs are not as I expect them to be. I am expecting 5 columns: chr, regionStart, regionEnd, count(ABBA), count(BABA). The first three columns are there, but then there are 2*(number individuals in bamlist) columns filled with 1. Further, I am using -r scaffold_100, and it prints a line for each scaffold from 0 to 100, and then starts to work on every chunk in scaffold_100.

[image: Inline image 1]

My command line is this:

$ANGSD -doAbbababa 1 -bam Code/D_test1.filelist -blockSize 5e4 -anc $ANC -out Results/Abba/D_test1 \ -doCounts 1 -r $SCAF -sites ${SITES} \ -remove_bads 1 -trim 0 -minMapQ 20 -minQ 20 -only_proper_pairs 0

$ANC is the ancestral fasta that I made by mapping reads from an outgroup onto a reference genome, which comes from the ingroup.

Another point of confusion, I don't understand why the example shows a 5-population bamfile. These must include H1, H2, H3, but what are the other two? I've tried it with bamlist containing 5 and 3 individuals

Any help with this will be greatly appreciated! I can't find other examples online for comparison.

Kind regards, John

On Wed, Jul 19, 2017 at 4:43 AM, ANGSD notifications@github.com wrote:

Hi John, as i remember people found that the unweighted was much to noisy, so we removed it from the output. You can uncomment this line: https://github.com/ANGSD/angsd/blob/master/misc/safstat.cpp#L299

Or you might want to copy unweight[i]/(1.0*nObs) into fstW[i]. To use the existing framework etc.

On Wed, Jul 19, 2017 at 2:21 AM, johnburley3000 notifications@github.com wrote:

Hi all,

Wondering if I can have some assistance to extract unweighted Fst in windows from my low-coverage population genomic data. I have followed the realSFS method to estimate Fst values from population pairs, and I've also used the 3 population option to get population branch stats (PBS).

I'll describe the problem using 2 populations: The command, $realSFS fst stats ${OUTDIR}/${P1}${P2}${REGIONS}.fst.idx > ${OUTDIR}/${P1}${P2}${REGIONS}.fst.txt outputs estimates of unweighted and weighted Fst, the latter being considerably higher (i.e. 0.1 compared to 0.5 for populations that I know are very similar) (the difference between unweighted and weighted Fst are described here: #16 https://github.com/ANGSD/angsd/issues/16). Then, the command: $realSFS fst stats2 ${OUTDIR}/${P1}${P2}${REGIONS}.fst.idx -win 25000 -step 25000 > ${OUTDIR}/${P1}${P2}${REGIONS}_25kbwinfst.txt only retrieves the weighted Fst in windows. I also tried -whichFst 1.

I would prefer to use the unweighted (lower) estimate of Fst, averaged across windows, but can't figure out how to extract this. The weighted Fst values seem unrealistically high, and so the Fst plots are a noisy mess of points between 0-1.

I think this quote from the instructions on how to get PBS/FST is relevant: "where columns are: chromosome, position, (a), (a+b) values for the three FST comparisons, where FST is defined as a/(a+b). Note that FST on multiple SNPs is calculated as sum(a)/sum(a+b)." - https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md

I am a bit confused by this - is it that the weighted estimate is used whenever there are multiple SNPs per window? This would be the case in every window, and would explain why I only have weighted Fst in the window output.

In conclusion: Is there any way to retrieve unweighted Fst in windows, using ANGSD??

Thanks, John

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7t-REqkoHh- r0nzbmtAFPealeVpVks5sPVn_gaJpZM4OcHQw .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-316315795, or mute the thread https://github.com/notifications/unsubscribe-auth/ARexMtEhHObFU0fVvUl-M-ZYsZWcMZcCks5sPcGlgaJpZM4OcHQw .

johnburley3000 commented 7 years ago

Hi again,

An update on the problem described above (hopefully before you read the message):

I have now got expected output, however I think there might be a problem with using -r in -doAbbababa. Here is what I have observed:

Previously, I had tried my command with -r scaffold_100, and this produced an .abbababa file full of zeros, as shown in the email above. Today, I tried using -r scaffold_0, and this produced the expected output. The only difference between scaffold_0 and scaffold_0 is 15,575,731 bp and scaffold_100 is 3,104,910 bp. I think it should still work on a smaller scaffold...

To update you, the command I used today is: $ANGSD -doAbbababa 1 -bam Code/D_test1.filelist -out Results/Abba/D_test1 -doCounts 1 -anc $ANC -minMapQ 20 -minQ 20 -blockSize 50000 -r $SCAF -sites $SITES

with SCAF = scaffold_0 and scaffold_100. I have also tried with blocksize as low as 5kb for scaffold_100.

scaffold_100 is a real scaffold on all sequences, and has ~13,733 SNPs according to my -sites file.

John

On Mon, Jul 24, 2017 at 9:51 PM, John Burley john.burley3000@gmail.com wrote:

Hi, thanks for your help with this. I was able to get the unweighted output following your recommendation. After looking into this some more, I think that the main reason for my noisy Fst scan is actually that I was using small window sizes. It seems there is a just lot of variation in fst across the genome for the paired populations I am examining.

I am now having an issue with -doAbbababa, and I would really appreciate any help, for I can't find other examples online besides in your instructions that I am trying to follow (http://www.popgen.dk/angsd/ index.php/Abbababa).

The .abbababa outputs are not as I expect them to be. I am expecting 5 columns: chr, regionStart, regionEnd, count(ABBA), count(BABA). The first three columns are there, but then there are 2*(number individuals in bamlist) columns filled with 1. Further, I am using -r scaffold_100, and it prints a line for each scaffold from 0 to 100, and then starts to work on every chunk in scaffold_100.

[image: Inline image 1]

My command line is this:

$ANGSD -doAbbababa 1 -bam Code/D_test1.filelist -blockSize 5e4 -anc $ANC -out Results/Abba/D_test1 \ -doCounts 1 -r $SCAF -sites ${SITES} \ -remove_bads 1 -trim 0 -minMapQ 20 -minQ 20 -only_proper_pairs 0

$ANC is the ancestral fasta that I made by mapping reads from an outgroup onto a reference genome, which comes from the ingroup.

Another point of confusion, I don't understand why the example shows a 5-population bamfile. These must include H1, H2, H3, but what are the other two? I've tried it with bamlist containing 5 and 3 individuals

Any help with this will be greatly appreciated! I can't find other examples online for comparison.

Kind regards, John

On Wed, Jul 19, 2017 at 4:43 AM, ANGSD notifications@github.com wrote:

Hi John, as i remember people found that the unweighted was much to noisy, so we removed it from the output. You can uncomment this line: https://github.com/ANGSD/angsd/blob/master/misc/safstat.cpp#L299

Or you might want to copy unweight[i]/(1.0*nObs) into fstW[i]. To use the existing framework etc.

On Wed, Jul 19, 2017 at 2:21 AM, johnburley3000 <notifications@github.com

wrote:

Hi all,

Wondering if I can have some assistance to extract unweighted Fst in windows from my low-coverage population genomic data. I have followed the realSFS method to estimate Fst values from population pairs, and I've also used the 3 population option to get population branch stats (PBS).

I'll describe the problem using 2 populations: The command, $realSFS fst stats ${OUTDIR}/${P1}${P2}${REGIONS}.fst.idx > ${OUTDIR}/${P1}${P2}${REGIONS}.fst.txt outputs estimates of unweighted and weighted Fst, the latter being considerably higher (i.e. 0.1 compared to 0.5 for populations that I know are very similar) (the difference between unweighted and weighted Fst are described here: #16 https://github.com/ANGSD/angsd/issues/16). Then, the command: $realSFS fst stats2 ${OUTDIR}/${P1}${P2}${REGIONS}.fst.idx -win 25000 -step 25000 > ${OUTDIR}/${P1}${P2}${REGIONS}_25kbwinfst.txt only retrieves the weighted Fst in windows. I also tried -whichFst 1.

I would prefer to use the unweighted (lower) estimate of Fst, averaged across windows, but can't figure out how to extract this. The weighted Fst values seem unrealistically high, and so the Fst plots are a noisy mess of points between 0-1.

I think this quote from the instructions on how to get PBS/FST is relevant: "where columns are: chromosome, position, (a), (a+b) values for the three FST comparisons, where FST is defined as a/(a+b). Note that FST on multiple SNPs is calculated as sum(a)/sum(a+b)." - https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md

I am a bit confused by this - is it that the weighted estimate is used whenever there are multiple SNPs per window? This would be the case in every window, and would explain why I only have weighted Fst in the window output.

In conclusion: Is there any way to retrieve unweighted Fst in windows, using ANGSD??

Thanks, John

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7t- REqkoHh-r0nzbmtAFPealeVpVks5sPVn_gaJpZM4OcHQw .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-316315795, or mute the thread https://github.com/notifications/unsubscribe-auth/ARexMtEhHObFU0fVvUl-M-ZYsZWcMZcCks5sPcGlgaJpZM4OcHQw .

aalbrechtsen commented 7 years ago

Dear John

If I understand you correctly then the problem is that you don't have any output when using one of your scaffolds?

You will get an output of zeroes if there are no ABBA or BABA sites sampled. Have you looked at your input to verify that you have data from all 3? individuals you are using and that there should be ABBA or BABA sites. You can use -doCounts 1 -dumpCounts 3 to see the number of A/G/C/T for each individual

ps. I would not recommend using a small blockSize unless you are certain the you effective population size is so large that you don't expect LD between markers futher away from each other than the block size.

-Anders

johnburley3000 commented 7 years ago

Hi Anders,

Thanks for the response. I'm currently running -doAbbababa for all of my autosomal scaffolds and z-chromosome separately, and there are no apparent problems. I am now running for 5 individuals from different populations, as in your BMC Bioinformatics 2014 example.

I wanted to tell you about how it had failed with scaffold_100 because it might indicate a bug. I was only doing this for testing purposes - I'm not actually interested in that scaffold particularly.

I tested the smaller blockSize values because I thought perhaps there is a minimum block number, but that is probably not the case.

I will now run for scaffold_100 with -doCounts 1 -dumpCounts 3, as you suggest. I'll let you know whether there should be ABBA or BABA sites.

John

On Tue, Jul 25, 2017 at 10:32 AM, Anders Albrechtsen < notifications@github.com> wrote:

Dear John

If I understand you correctly then the problem is that you don't have any output when using one of your scaffolds?

You will get an output of zeroes if there are no ABBA or BABA sites sampled. Have you looked at your input to verify that you have data from all 3? individuals you are using and that there should be ABBA or BABA sites. You can use -doCounts 1 -dumpCounts 3 to see the number of A/G/C/T for each individual

ps. I would not recommend using a small blockSize unless you are certain the you effective population size is so large that you don't expect LD between markers futher away from each other than the block size.

-Anders

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317756706, or mute the thread https://github.com/notifications/unsubscribe-auth/ARexMiBx6ZOqZvolQq_pmo_Z-i7m4j9uks5sRfx8gaJpZM4OcHQw .

johnburley3000 commented 7 years ago

Hi Anders,

I have found that there is in fact output data for my scaffold_100 attempt (and the -dumpCounts output shows that there are biallelic sites).

The problem, seems to be minor, although you may be interested. The output, when I specify -r scaffold_100, gives empty rows for each scaffold up to scaffold_100, and then it begins to output data as expected. I haven't noticed this for any of the other ANGSD functions, when I use the -r option.

The .abbababa and .args files are attached, if you would like to see, from the run:

$ANGSD -doAbbababa 1 -bam Code/D_test1.filelist -out Results/Abba/D_test2 -doCounts 1 -dumpCounts 3 -anc $ANC -minMapQ 20 -minQ 20 -blockSize 100000 -r scaffold_100 -sites $SITES

On Tue, Jul 25, 2017 at 10:46 AM, John Burley john.burley3000@gmail.com wrote:

Hi Anders,

Thanks for the response. I'm currently running -doAbbababa for all of my autosomal scaffolds and z-chromosome separately, and there are no apparent problems. I am now running for 5 individuals from different populations, as in your BMC Bioinformatics 2014 example.

I wanted to tell you about how it had failed with scaffold_100 because it might indicate a bug. I was only doing this for testing purposes - I'm not actually interested in that scaffold particularly.

I tested the smaller blockSize values because I thought perhaps there is a minimum block number, but that is probably not the case.

I will now run for scaffold_100 with -doCounts 1 -dumpCounts 3, as you suggest. I'll let you know whether there should be ABBA or BABA sites.

John

On Tue, Jul 25, 2017 at 10:32 AM, Anders Albrechtsen < notifications@github.com> wrote:

Dear John

If I understand you correctly then the problem is that you don't have any output when using one of your scaffolds?

You will get an output of zeroes if there are no ABBA or BABA sites sampled. Have you looked at your input to verify that you have data from all 3? individuals you are using and that there should be ABBA or BABA sites. You can use -doCounts 1 -dumpCounts 3 to see the number of A/G/C/T for each individual

ps. I would not recommend using a small blockSize unless you are certain the you effective population size is so large that you don't expect LD between markers futher away from each other than the block size.

-Anders

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317756706, or mute the thread https://github.com/notifications/unsubscribe-auth/ARexMiBx6ZOqZvolQq_pmo_Z-i7m4j9uks5sRfx8gaJpZM4OcHQw .

aalbrechtsen commented 7 years ago

Glad you figured it out.

-Anders

ANGSD commented 7 years ago

Hello John, i agree this is not intuitive. This is infact the most ancient issue with angsd on the github issue lidt. https://github.com/ANGSD/angsd/issues/20, dating back to november 2015

Best

On Tue, Jul 25, 2017 at 5:29 PM, johnburley3000 notifications@github.com wrote:

Hi Anders,

I have found that there is in fact output data for my scaffold_100 attempt (and the -dumpCounts output shows that there are biallelic sites).

The problem, seems to be minor, although you may be interested. The output, when I specify -r scaffold_100, gives empty rows for each scaffold up to scaffold_100, and then it begins to output data as expected. I haven't noticed this for any of the other ANGSD functions, when I use the -r option.

The .abbababa and .args files are attached, if you would like to see, from the run:

$ANGSD -doAbbababa 1 -bam Code/D_test1.filelist -out Results/Abba/D_test2 -doCounts 1 -dumpCounts 3 -anc $ANC -minMapQ 20 -minQ 20 -blockSize 100000 -r scaffold_100 -sites $SITES

On Tue, Jul 25, 2017 at 10:46 AM, John Burley john.burley3000@gmail.com wrote:

Hi Anders,

Thanks for the response. I'm currently running -doAbbababa for all of my autosomal scaffolds and z-chromosome separately, and there are no apparent problems. I am now running for 5 individuals from different populations, as in your BMC Bioinformatics 2014 example.

I wanted to tell you about how it had failed with scaffold_100 because it might indicate a bug. I was only doing this for testing purposes - I'm not actually interested in that scaffold particularly.

I tested the smaller blockSize values because I thought perhaps there is a minimum block number, but that is probably not the case.

I will now run for scaffold_100 with -doCounts 1 -dumpCounts 3, as you suggest. I'll let you know whether there should be ABBA or BABA sites.

John

On Tue, Jul 25, 2017 at 10:32 AM, Anders Albrechtsen < notifications@github.com> wrote:

Dear John

If I understand you correctly then the problem is that you don't have any output when using one of your scaffolds?

You will get an output of zeroes if there are no ABBA or BABA sites sampled. Have you looked at your input to verify that you have data from all 3? individuals you are using and that there should be ABBA or BABA sites. You can use -doCounts 1 -dumpCounts 3 to see the number of A/G/C/T for each individual

ps. I would not recommend using a small blockSize unless you are certain the you effective population size is so large that you don't expect LD between markers futher away from each other than the block size.

-Anders

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317756706, or mute the thread https://github.com/notifications/unsubscribe- auth/ARexMiBx6ZOqZvolQq_pmo_Z-i7m4j9uks5sRfx8gaJpZM4OcHQw .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317774844, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7sySuqGV3XPRZ1A49M_W5PB19p2Sks5sRgnvgaJpZM4OcHQw .

johnburley3000 commented 7 years ago

Hi Anders,

Thanks again for your help, and for developing ANGSD - it's great!

John

On Thu, Jul 27, 2017 at 5:03 AM, ANGSD notifications@github.com wrote:

Hello John, i agree this is not intuitive. This is infact the most ancient issue with angsd on the github issue lidt. https://github.com/ANGSD/angsd/issues/20, dating back to november 2015

Best

On Tue, Jul 25, 2017 at 5:29 PM, johnburley3000 notifications@github.com wrote:

Hi Anders,

I have found that there is in fact output data for my scaffold_100 attempt (and the -dumpCounts output shows that there are biallelic sites).

The problem, seems to be minor, although you may be interested. The output, when I specify -r scaffold_100, gives empty rows for each scaffold up to scaffold_100, and then it begins to output data as expected. I haven't noticed this for any of the other ANGSD functions, when I use the -r option.

The .abbababa and .args files are attached, if you would like to see, from the run:

$ANGSD -doAbbababa 1 -bam Code/D_test1.filelist -out Results/Abba/D_test2 -doCounts 1 -dumpCounts 3 -anc $ANC -minMapQ 20 -minQ 20 -blockSize 100000 -r scaffold_100 -sites $SITES

On Tue, Jul 25, 2017 at 10:46 AM, John Burley <john.burley3000@gmail.com

wrote:

Hi Anders,

Thanks for the response. I'm currently running -doAbbababa for all of my autosomal scaffolds and z-chromosome separately, and there are no apparent problems. I am now running for 5 individuals from different populations, as in your BMC Bioinformatics 2014 example.

I wanted to tell you about how it had failed with scaffold_100 because it might indicate a bug. I was only doing this for testing purposes - I'm not actually interested in that scaffold particularly.

I tested the smaller blockSize values because I thought perhaps there is a minimum block number, but that is probably not the case.

I will now run for scaffold_100 with -doCounts 1 -dumpCounts 3, as you suggest. I'll let you know whether there should be ABBA or BABA sites.

John

On Tue, Jul 25, 2017 at 10:32 AM, Anders Albrechtsen < notifications@github.com> wrote:

Dear John

If I understand you correctly then the problem is that you don't have any output when using one of your scaffolds?

You will get an output of zeroes if there are no ABBA or BABA sites sampled. Have you looked at your input to verify that you have data from all 3? individuals you are using and that there should be ABBA or BABA sites. You can use -doCounts 1 -dumpCounts 3 to see the number of A/G/C/T for each individual

ps. I would not recommend using a small blockSize unless you are certain the you effective population size is so large that you don't expect LD between markers futher away from each other than the block size.

-Anders

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317756706, or mute the thread https://github.com/notifications/unsubscribe- auth/ARexMiBx6ZOqZvolQq_pmo_Z-i7m4j9uks5sRfx8gaJpZM4OcHQw .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317774844, or mute the thread https://github.com/notifications/unsubscribe- auth/AGDo7sySuqGV3XPRZ1A49M_W5PB19p2Sks5sRgnvgaJpZM4OcHQw .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-318303662, or mute the thread https://github.com/notifications/unsubscribe-auth/ARexMrqf-ztiWHdbW29UZaKbSBCTdw4yks5sSFJzgaJpZM4OcHQw .

johnburley3000 commented 7 years ago

Hello again:

Two quick notes regarding the R scripts to modify and plot joint SFS:

I've found what might be a typo in ngsTools/ngsTools/Scripts/SFSangsd2tools.R

The final line was:

write.table(matrix(sfs,nrow=((2*n1)+1),ncol=((2*n1)+1)), sep="\t", row.names=F, col.names=F, quote=F)

I think it should be:

write.table(matrix(sfs,nrow=((2*n1)+1),ncol=((2*n2)+1)), sep="\t", row.names=F, col.names=F, quote=F)

Please let me know me if I am mistaken about this.

The code to plot 2D SFS: ngsTools/ngsTools/Scripts/plot2DSFS.R does not work properly for my cases with unequal sample size (although the .sfs file for the lower plot looks fine). See below for examples that work and do not work:

[image: Inline image 1]

[image: Inline image 2]

On Sun, Jul 30, 2017 at 11:38 AM, John Burley john.burley3000@gmail.com wrote:

Hi Anders,

Thanks again for your help, and for developing ANGSD - it's great!

John

On Thu, Jul 27, 2017 at 5:03 AM, ANGSD notifications@github.com wrote:

Hello John, i agree this is not intuitive. This is infact the most ancient issue with angsd on the github issue lidt. https://github.com/ANGSD/angsd/issues/20, dating back to november 2015

Best

On Tue, Jul 25, 2017 at 5:29 PM, johnburley3000 <notifications@github.com

wrote:

Hi Anders,

I have found that there is in fact output data for my scaffold_100 attempt (and the -dumpCounts output shows that there are biallelic sites).

The problem, seems to be minor, although you may be interested. The output, when I specify -r scaffold_100, gives empty rows for each scaffold up to scaffold_100, and then it begins to output data as expected. I haven't noticed this for any of the other ANGSD functions, when I use the -r option.

The .abbababa and .args files are attached, if you would like to see, from the run:

$ANGSD -doAbbababa 1 -bam Code/D_test1.filelist -out Results/Abba/D_test2 -doCounts 1 -dumpCounts 3 -anc $ANC -minMapQ 20 -minQ 20 -blockSize 100000 -r scaffold_100 -sites $SITES

On Tue, Jul 25, 2017 at 10:46 AM, John Burley < john.burley3000@gmail.com> wrote:

Hi Anders,

Thanks for the response. I'm currently running -doAbbababa for all of my autosomal scaffolds and z-chromosome separately, and there are no apparent problems. I am now running for 5 individuals from different populations, as in your BMC Bioinformatics 2014 example.

I wanted to tell you about how it had failed with scaffold_100 because it might indicate a bug. I was only doing this for testing purposes - I'm not actually interested in that scaffold particularly.

I tested the smaller blockSize values because I thought perhaps there is a minimum block number, but that is probably not the case.

I will now run for scaffold_100 with -doCounts 1 -dumpCounts 3, as you suggest. I'll let you know whether there should be ABBA or BABA sites.

John

On Tue, Jul 25, 2017 at 10:32 AM, Anders Albrechtsen < notifications@github.com> wrote:

Dear John

If I understand you correctly then the problem is that you don't have any output when using one of your scaffolds?

You will get an output of zeroes if there are no ABBA or BABA sites sampled. Have you looked at your input to verify that you have data from all 3? individuals you are using and that there should be ABBA or BABA sites. You can use -doCounts 1 -dumpCounts 3 to see the number of A/G/C/T for each individual

ps. I would not recommend using a small blockSize unless you are certain the you effective population size is so large that you don't expect LD between markers futher away from each other than the block size.

-Anders

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317756706, or mute the thread https://github.com/notifications/unsubscribe- auth/ARexMiBx6ZOqZvolQq_pmo_Z-i7m4j9uks5sRfx8gaJpZM4OcHQw .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317774844, or mute the thread https://github.com/notifications/unsubscribe-auth/ AGDo7sySuqGV3XPRZ1A49M_W5PB19p2Sks5sRgnvgaJpZM4OcHQw .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-318303662, or mute the thread https://github.com/notifications/unsubscribe-auth/ARexMrqf-ztiWHdbW29UZaKbSBCTdw4yks5sSFJzgaJpZM4OcHQw .

mfumagalli commented 7 years ago

Hi there,

there is a new script in ngsTools to plot the 2D-SFS and it should work with unequal sample size (see the Tutorial), let me know otherwise.

cheers

Matteo

On 6 August 2017 at 05:55, johnburley3000 notifications@github.com wrote:

Hello again:

Two quick notes regarding the R scripts to modify and plot joint SFS:

I've found what might be a typo in ngsTools/ngsTools/Scripts/SFSangsd2tools.R

The final line was:

write.table(matrix(sfs,nrow=((2*n1)+1),ncol=((2*n1)+1)), sep="\t", row.names=F, col.names=F, quote=F)

I think it should be:

write.table(matrix(sfs,nrow=((2*n1)+1),ncol=((2*n2)+1)), sep="\t", row.names=F, col.names=F, quote=F)

Please let me know me if I am mistaken about this.

The code to plot 2D SFS: ngsTools/ngsTools/Scripts/plot2DSFS.R does not work properly for my cases with unequal sample size (although the .sfs file for the lower plot looks fine). See below for examples that work and do not work:

[image: Inline image 1]

[image: Inline image 2]

On Sun, Jul 30, 2017 at 11:38 AM, John Burley john.burley3000@gmail.com

wrote:

Hi Anders,

Thanks again for your help, and for developing ANGSD - it's great!

John

On Thu, Jul 27, 2017 at 5:03 AM, ANGSD notifications@github.com wrote:

Hello John, i agree this is not intuitive. This is infact the most ancient issue with angsd on the github issue lidt. https://github.com/ANGSD/angsd/issues/20, dating back to november 2015

Best

On Tue, Jul 25, 2017 at 5:29 PM, johnburley3000 < notifications@github.com

wrote:

Hi Anders,

I have found that there is in fact output data for my scaffold_100 attempt (and the -dumpCounts output shows that there are biallelic sites).

The problem, seems to be minor, although you may be interested. The output, when I specify -r scaffold_100, gives empty rows for each scaffold up to scaffold_100, and then it begins to output data as expected. I haven't noticed this for any of the other ANGSD functions, when I use the -r option.

The .abbababa and .args files are attached, if you would like to see, from the run:

$ANGSD -doAbbababa 1 -bam Code/D_test1.filelist -out Results/Abba/D_test2 -doCounts 1 -dumpCounts 3 -anc $ANC -minMapQ 20 -minQ 20 -blockSize 100000 -r scaffold_100 -sites $SITES

On Tue, Jul 25, 2017 at 10:46 AM, John Burley < john.burley3000@gmail.com> wrote:

Hi Anders,

Thanks for the response. I'm currently running -doAbbababa for all of my autosomal scaffolds and z-chromosome separately, and there are no apparent problems. I am now running for 5 individuals from different populations, as in your BMC Bioinformatics 2014 example.

I wanted to tell you about how it had failed with scaffold_100 because it might indicate a bug. I was only doing this for testing purposes - I'm not actually interested in that scaffold particularly.

I tested the smaller blockSize values because I thought perhaps there is a minimum block number, but that is probably not the case.

I will now run for scaffold_100 with -doCounts 1 -dumpCounts 3, as you suggest. I'll let you know whether there should be ABBA or BABA sites.

John

On Tue, Jul 25, 2017 at 10:32 AM, Anders Albrechtsen < notifications@github.com> wrote:

Dear John

If I understand you correctly then the problem is that you don't have any output when using one of your scaffolds?

You will get an output of zeroes if there are no ABBA or BABA sites sampled. Have you looked at your input to verify that you have data from all 3? individuals you are using and that there should be ABBA or BABA sites. You can use -doCounts 1 -dumpCounts 3 to see the number of A/G/C/T for each individual

ps. I would not recommend using a small blockSize unless you are certain the you effective population size is so large that you don't expect LD between markers futher away from each other than the block size.

-Anders

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317756706, or mute the thread https://github.com/notifications/unsubscribe- auth/ARexMiBx6ZOqZvolQq_pmo_Z-i7m4j9uks5sRfx8gaJpZM4OcHQw .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-317774844, or mute the thread https://github.com/notifications/unsubscribe-auth/ AGDo7sySuqGV3XPRZ1A49M_W5PB19p2Sks5sRgnvgaJpZM4OcHQw .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-318303662, or mute the thread https://github.com/notifications/unsubscribe-auth/ARexMrqf- ztiWHdbW29UZaKbSBCTdw4yks5sSFJzgaJpZM4OcHQw .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/92#issuecomment-320484060, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGvCalGDerN9hIqFwZSUhvqsJiXdACgks5sVTkngaJpZM4OcHQw .