ANGSD / angsd

Program for analysing NGS data.
230 stars 51 forks source link

doFasta misses indels #233

Open PAMorin opened 5 years ago

PAMorin commented 5 years ago

I'm using doFasta -2 to extract the consensus sequence from bwa alignment (bam) files (for mitochondrial genomes). I've found that if there is a variable length repeat, ANGSD always generates the longer sequence even if most of the reads have the shorter repeat (e.g., if most are CCC, but a few are CCCC, the consensus is always CCCC). I think this is because ANGSD only counts the number of A, C, G, or T occurrences at the location, and ignores the indel "-" in the bam file. Is there a way to refine the bam alignment or otherwise use ANGSD doFasta to recognize indels?

aalbrechtsen commented 5 years ago

Dear PAMorin

ANGSD cannot generate a FASTA file based on the reference genome that is provided. So unfortunately it cannot include indels in the FASTA file.

-Anders

ekg commented 5 years ago

I believe you can use bcftools consensus to also handle indels.

On Sun, Jun 30, 2019, 07:35 Anders Albrechtsen notifications@github.com wrote:

Dear PAMorin

ANGSD cannot generate a FASTA file based on the reference genome that is provided. So unfortunately it cannot include indels in the FASTA file.

-Anders

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/233?email_source=notifications&email_token=AABDQEMPAHYI5VSY3QV6UA3P5CLATA5CNFSM4H4AYTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY4KLKI#issuecomment-507028905, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEK6JNUEPJJK36DTKULP5CLATANCNFSM4H4AYTRA .

PAMorin commented 5 years ago

Thanks. I am starting to use the bcf approach, as described at: https://samtools.github.io/bcftools/howtos/consensus-sequence.html. It appears to be doing a good job at calling indels around variable repeats.

Phil

On 6/30/19 1:54 PM, Erik Garrison wrote:

I believe you can use bcftools consensus to also handle indels.

On Sun, Jun 30, 2019, 07:35 Anders Albrechtsen notifications@github.com wrote:

Dear PAMorin

ANGSD cannot generate a FASTA file based on the reference genome that is provided. So unfortunately it cannot include indels in the FASTA file.

-Anders

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub

https://github.com/ANGSD/angsd/issues/233?email_source=notifications&email_token=AABDQEMPAHYI5VSY3QV6UA3P5CLATA5CNFSM4H4AYTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY4KLKI#issuecomment-507028905, or mute the thread

https://github.com/notifications/unsubscribe-auth/AABDQEK6JNUEPJJK36DTKULP5CLATANCNFSM4H4AYTRA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/233?email_source=notifications&email_token=AFAAERLBSM6Z5JI5TOM6SY3P5EMOZA5CNFSM4H4AYTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY4TS5I#issuecomment-507066741, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAAERMDRHMVLAN2NAMBM5TP5EMOZANCNFSM4H4AYTRA.

--

Phillip A. Morin, Ph.D. Southwest Fisheries Science Center 8901 La Jolla Shores Dr. La Jolla, CA 92037, USA Phone: 858-546-7165 Fax: 858-546-7003 phillip.morin@noaa.gov http://swfsc.noaa.gov/mmtd-mmgenetics

ekg commented 5 years ago

As for calling indels, I'm not sure how good bcftools is. But it does have the tool to generate the consensus.

On Tue, Jul 2, 2019, 20:00 Phillip Morin notifications@github.com wrote:

Thanks. I am starting to use the bcf approach, as described at: https://samtools.github.io/bcftools/howtos/consensus-sequence.html. It appears to be doing a good job at calling indels around variable repeats.

Phil

On 6/30/19 1:54 PM, Erik Garrison wrote:

I believe you can use bcftools consensus to also handle indels.

On Sun, Jun 30, 2019, 07:35 Anders Albrechtsen <notifications@github.com

wrote:

Dear PAMorin

ANGSD cannot generate a FASTA file based on the reference genome that is provided. So unfortunately it cannot include indels in the FASTA file.

-Anders

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub

< https://github.com/ANGSD/angsd/issues/233?email_source=notifications&email_token=AABDQEMPAHYI5VSY3QV6UA3P5CLATA5CNFSM4H4AYTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY4KLKI#issuecomment-507028905 , or mute the thread

< https://github.com/notifications/unsubscribe-auth/AABDQEK6JNUEPJJK36DTKULP5CLATANCNFSM4H4AYTRA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/ANGSD/angsd/issues/233?email_source=notifications&email_token=AFAAERLBSM6Z5JI5TOM6SY3P5EMOZA5CNFSM4H4AYTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY4TS5I#issuecomment-507066741>,

or mute the thread < https://github.com/notifications/unsubscribe-auth/AFAAERMDRHMVLAN2NAMBM5TP5EMOZANCNFSM4H4AYTRA .

--

Phillip A. Morin, Ph.D. Southwest Fisheries Science Center 8901 La Jolla Shores Dr. La Jolla, CA 92037, USA Phone: 858-546-7165 Fax: 858-546-7003 phillip.morin@noaa.gov http://swfsc.noaa.gov/mmtd-mmgenetics

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/233?email_source=notifications&email_token=AABDQEKAD4DSZWCW75PTHJ3P5PT27A5CNFSM4H4AYTRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZC33FA#issuecomment-507886996, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEO57OML2NOSSEVSOUTP5PT27ANCNFSM4H4AYTRA .