phyluce_align_get_only_loci_with_min_taxa not giving the correct number of taxa in output

camilocalderon commented 5 years ago

Hi, I would like to know if people have encountered the following issue:

I am trying to build a complete data matrix using the --percentage 1.0 \ flag with the phyluce_align_get_only_loci_with_min_taxa script. I have 42 taxa in my UCE dataset. Based on the example in Tutorial I of the Read the Docs documentation for PHYLUCE, the percentage is actually a proportion float value, so 1.0 would be the correct value for 100%, 0.95 would be correct for 95%, and so on. However, after reduction with taxon completeness percentage of 100% and then phasing, the output alignments do not contain 100% of my taxa. I see incorrect numbers of phased taxa also when using other percentage values with this procedure. Is there a bug?

Additionally, when working with percentages other than 100%, such as 75%, we find that not all alleles are present for each sample, presumably due to homozygotes. But designations for the single remaining sequence in such cases is either _0 or _1. Is the retention of one sequence for homozygotes random?

Thanks!

BenjaminBlanchard commented 4 years ago

I am experiencing this same issue! I have attached my summary file (this is the summary output for my internal and Gblocks trimmed dataset with 155 taxa) here: phyluce_align_get_align_summary_data.log. As clearly stated in this file, the Matrix 75% data matrix includes 2011 alignments. Yet, when I run phyluce_align_get_only_loci_with_min_taxa, it only copies 86 alignments. Similarly, the Matrix 65% data matrix includes 2227 alignments according to the summary file, yet only 1584 alignments are copied.

The functions I ran at this stage are:

phyluce_align_seqcap_align --fasta all-taxa-incomplete.fasta --output mafft-nexus-internal-trimmed --taxa 155 --aligner mafft --cores 14 --incomplete-matrix --output-format fasta --no-trim --log-path log

Then:

phyluce_align_get_gblocks_trimmed_alignments_from_untrimmed --alignments mafft-nexus-internal-trimmed --output mafft-nexus-internal-trimmed-gblocks --cores 14 --log log

Then the output summary stats function (which produced the file linked above) and the function to remove locus names from nexus lines, followed by:

phyluce_align_get_only_loci_with_min_taxa --alignments mat-nexus-internal-trimmed-gblocks-clean --taxa 155 --percent 0.75 --output mafft-nexus-internal-trimmed-gblocks-clean-75p --cores 14 --log-path log

Which copies 86 alignments when it should copy 2011, according to the summary data log file (and 1584 instead of 2227 for --percent 0.65, as an arbitrary additional example).

Any idea what might be going on here?

brantfaircloth commented 4 years ago

Getting 86 alignments when 2011 are expected is definitely wrong, although I don't know why. Getting something like 1584 when 2227 are expected makes a bit more sense - the reasons have to do with rounding and the fact that the alignment summary log values are approximate.

brantfaircloth commented 4 years ago

@camilocalderon - it is likely the phasing is affecting the numbers of taxa within a given alignment when you start w/ a 100% complete matrix. as for all alleles not being present for all samples, it may be that only one sequence is being retained when you have a heterozygote - you'll need to dig into the data to help verify (and/or fix whatever bug is causing this).

BenjaminBlanchard commented 4 years ago

One thing I noticed, I don't know if it matters - but in the summary log file, it has Taxa [min] as 5 and Taxa [max] as 123. I have 155 taxa. But if you take 0.75 of 123, you get 92.25. And 92.25/155 = 59.5%. When I set the fraction as 0.595, I get 2011 alignments copied...

brantfaircloth commented 4 years ago

@BenjaminBlanchard ok - that helps. so it looks like what is happening is that you never have an alignment with 100% of you taxa in there, so the code is using the maximum number of taxa detected in any alignment versus the maximum number you specify. I'll need to fix that, but for the moment it looks like you've got what you want.

BenjaminBlanchard commented 4 years ago

Ohhhh ok! Glad to have identified the issue - and yes, this is reasonable, as I know I have a set of samples in here from expected-to-be fairly degraded DNA (e.g. one sample with only ~30 alignments) - likely will drop them, but wanted to start with the full dataset. This also makes more sense now, as I was a bit surprised to have had as high values as I did, knowing I had ~20 particularly low-alignment-number samples. Thanks... I'm glad to finally have hit an issue that wasn't only user error haha.

brantfaircloth commented 4 years ago

Back in the day, we always assumed you would have at least some loci where you had 100% taxon occupancy, but over time that assumption has become less true. funny story - people used to reject papers if you didn't have a 100% matrix in them.

gehara commented 3 years ago

Hi all, I am having the same issue as @camilocalderon. Several samples are not duplicated after phasing and the "_0" or "_1" allele tag seems random. Is there a way to correct that? Any guess? This could be a problem down the line if these samples are not homozygous but heterozygous missing one allele.

faircloth-lab / phyluce

phyluce_align_get_only_loci_with_min_taxa not giving the correct number of taxa in output #149