faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
80 stars 49 forks source link

Workflows: phasing.conf contig paths #264

Closed StephenBaca closed 2 years ago

StephenBaca commented 2 years ago

Hi Dr. Faircloth,

I had two questions about data phasing and SNP calling via Workflows.

  1. I noticed that the paths to your contigs in your example phasing.conf file were different from those in the mapping.conf. And I was bit confused by the wording in the notes: "....you can copy over the contigs section of your mapping file...". I just wanted to clarify, am I directing the workflows phasing script (via contig paths in phasing.conf) to a different set of contigs than those reference contigs used in mapping (mapping.conf)? Edit: Ok, pretty sure I understand. I can copy the "Contigs section" from mapping.conf to the phasing.conf file.

  2. I am doing some species and population level analyses on my group. I am hoping to get to the end result of a pulling one random SNP per UCE locus for analyses in STRUCTURE, etc., optimally selected with read coverage and data completeness thresholds. The output vcfs of phasing should be very useful for this. To restrict phasing and thereby output vcfs to UCE loci, would it make sense to match contigs to probes and explode the monolithic fasta (by sample) to produce reference UCE contigs for mapping reads? This rather than mapping reads to all assembled contigs and pulling UCEs downstream?

Thanks much, Stephen Baca

StephenBaca commented 2 years ago

Follow-up: I tried using the probe-matched and exploded fastas as reference contigs for mapping reads. Everything seemed to work fine, however I noticed now the output vcfs do not include a coverage column, as was present when phasing reads to all assembled contigs. Would this be expected or did I do something wrong?

Thanks again, Stephen

brantfaircloth commented 2 years ago

Hi Stephen,

With respect to 1. - the paths should be the same to the contigs. The examples given are just taken from the software tests that run and there is a reason they are different (but point to the same basic place).

With respect to 2., I would use the exploded UCE contigs as the reference to which you are mapping.

Finally, I am not sure why you are seeing a difference between the output files - the programs on the back end that get run are identical - the only difference being the reference contigs that are input. When I look at the VCF files that are produced (and expected) by the software tests from the phasing workflow, those appear to be what one would expect (and these are run against an exploded set of UCE contigs).

StephenBaca commented 2 years ago

Much appreciate the quick response and clarifications!

For 2: I will start again from my SPAdes assemblies and run through the whole thing again to make sure I didn't miss anything.

Thanks again!

StephenBaca commented 2 years ago

Hi Dr. Faircloth, Ran through the whole thing from matching contigs to exploding the monolith fasta. Still not seeing coverage information in the output vcfs. A bit confused as I didn't change anything I did before barring using the exploded fastas as my contigs. Any thoughts on where the problem might lie? Thanks much, Stephen

brantfaircloth commented 2 years ago

Hmmm. That's odd. You could try to run the phasing externally (e.g. not using the workflow) and seeing if you get the same result. Alternatively, maybe I'm confused about what you mean re: coverage information?

StephenBaca commented 2 years ago

Thanks! Yes, I am trying now to follow Dr. Harvey's seq_cap_pop workflow now.

In going back and looking at the output vcfs, I see that I was confused on coverage info. For some reason, when mapping reads to all assembled contigs (without matching to probes first), the ID string in the phased vcfs included "...cov_1.32...." (see below) which made it look like there was a coverage column that was not included in the output vcfs after matching the contigs to probes.

I apologize for the confusion. I'm in new territory here in trying to responsibly harvest SNP data for downstream analyses. For that matter, I appreciate your patience and help with clarifying issues.

Here is why I was confused:

vcf output when reads mapped to all (unmatched) contigs: SLE895.1.vcf

.....

contig=

contig=

contig=

contig=

contig=

contig=

vcf output when using probe matched contigs

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

brantfaircloth commented 2 years ago

You don’t need to worry about the values in the VCF header. the actual coverage will be lower in the file when you get to the SNP calls 👍.

and, no worries. the best way to find what you need is to read through the VCF specification, here: https://samtools.github.io/hts-specs/VCFv4.2.pdf