chadlaing / Panseq

Pan-genomic sequence analysis
http://lfz.corefacility.ca/panseq
GNU General Public License v3.0
43 stars 14 forks source link

SNP positions #22

Closed moorembioinfo closed 6 years ago

moorembioinfo commented 6 years ago

Hi Chad,

I was interested in extracting a full core genome alignment from panseq by taking the core genome fragments and the core_snps.txt output to generate a multiple alignment for each gene, with each allele.

Panseq was run to generate 500bp fragments (and all the core genome fragments are 500bp) but in the core_snps.txt there are snps reported beyond position 500?

Thanks in advance for any help with this!!

Best regards,

Matt

chadlaing commented 6 years ago

Hi Matt,

The positions reported are relative to the original contig, not the fragment. So if you had a contig 20000 bp long, even though the fragment may only be 500 bp, the position will be with respect to the original length of 20000.

I hope this helps, Chad

moorembioinfo commented 6 years ago

Hi Chad,

Thanks for always getting back to me! That does help thanks.

If I can quickly ask another question, is there a way from the output data then to know where the snps are relative to the core genome fragments in order to generate a core genome alignment with variant and invariant sites (for assessing recombination).

Or, for example, to pull matched fragments out of Panseq earlier on?

Best regards,

Matt

moorembioinfo commented 6 years ago

Hi Chad,

Just as a quick follow up too, I realise now that you mean the positions are relative to the original reference contig, that is helpful, thanks.

I was initially concerned because there's only a handful of results that are beyond position 500 so I though perhaps this was some error where I thought it was actually supposed to be referring to fragments.

Knowing now that it refers to the reference genome this is a little concerning that all SNPs would be in the first 520bp of the reference genome. Having a look at the core_snps.txt file it seems that I have 498 unique positions in which there are SNPs (the reference genome is all one contig so these are absolute positions rather than mistakenly deduplicating for the same position in various contigs). However in snp.phylip 4371 polymorphic sites are reported. Shouldn't unique positions against a complete reference genome in core_snps.txt = the number of polymorphic sites in snp.phylip?

Thanks again in advance for any help with this!!

Best regards,

Matt

chadlaing commented 6 years ago

Hi Matt,

Do you have a small example you could send me?

Thanks, Chad

moorembioinfo commented 6 years ago

Hi Chad,

I noticed that I've asked you the question of core genome fragment extraction previously and your advice was to run storeAlleles 1 This gave me the fragments I wanted in locus_alleles.txt which I've filtered for those that are core, if I'm understanding this output properly!

On the core_snps.txt issue I've attached the files core_snps.txt and snp.phylip output.

The reference genome is in one contig and there are 495 unique SNP positions whereas snp.phylip reports 3596. Shouldn't these be the same or am I misunderstanding?

Thanks!

Matt

core_snps.txt.gz snp.phylip.gz

chadlaing commented 6 years ago

Hi Matt,

Yes, the storeAlleles option should get you what you want. I'll take a look at the provided files and get back to you.

Thanks, Chad

chadlaing commented 6 years ago

Hi Matt,

There should be a SNP listed for every genome provided that sequence data exist. The output would include SNPs for all differences among the genomes, and not just with respect to a reference.

Chad