phyluce_align_get_only_loci_with_min_taxa removes too many alignments

alexkrohn commented 3 years ago

I'm currently running 108 individuals of one putative species (estimated divergence among the individuals is ~2 million years) through Pyluce 1.7.1 using the 5k amniote UCE probes, default parameters, and only edge trimming. One exception to the defaults is that I used velvet assembly with a kmer of 35 instead of spades. The summary of the alignment says that I should have 157 alignments present in 75% of my individuals (see attached align_summary_edge_trim.txt). However, after trimming away loci names, and running phyluce_align_get_only_loci_with_min_taxa with --percent 0.75, only one alignment remains (see attached phyluce_align_get_only_loci_with_min_taxa.log)

Given the tutorials show the same number of alignments post-individual filtering as in the align_summary_edge_trim.txt file, I'm a bit confused as to why this is happening. Any ideas?

Background: running Phyluce 1.7.1 on an Ubuntu server running 16.04.6. We usually have to use a Docker shell to get 1.7.1 to work properly.

alexkrohn commented 3 years ago

Edit: I now see this this issue delves into similar issues. After using the --show-taxon-counts flag in the align_summary_data command, I see that there is little continuity of alignments across individuals. Alignments are maximized when only using 48 taxa (44% of the starting 108), and even then, it's only 100 loci from a starting 5k. (See below for truncated output.)

Any thoughts on why this might be? Problems with the assembly? I had no indication that there were too few reads, or poor quality reads from any individuals coming off the sequencer. Given that this is an intraspecific project (with two outgroup individuals from two different genera), I'm really surprised at the loci dropout.

.... 2021-05-11 17:06:37,985 - phyluce_align_get_align_summary_data - INFO - [Taxa] 100 alignments contain 48 taxa 2021-05-11 17:06:37,985 - phyluce_align_get_align_summary_data - INFO - [Taxa] 82 alignments contain 49 taxa 2021-05-11 17:06:37,985 - phyluce_align_get_align_summary_data - INFO - [Taxa] 82 alignments contain 50 taxa 2021-05-11 17:06:37,986 - phyluce_align_get_align_summary_data - INFO - [Taxa] 91 alignments contain 51 taxa 2021-05-11 17:06:37,986 - phyluce_align_get_align_summary_data - INFO - [Taxa] 87 alignments contain 52 taxa 2021-05-11 17:06:37,986 - phyluce_align_get_align_summary_data - INFO - [Taxa] 85 alignments contain 53 taxa 2021-05-11 17:06:37,986 - phyluce_align_get_align_summary_data - INFO - [Taxa] 89 alignments contain 54 taxa 2021-05-11 17:06:37,986 - phyluce_align_get_align_summary_data - INFO - [Taxa] 70 alignments contain 55 taxa 2021-05-11 17:06:37,986 - phyluce_align_get_align_summary_data - INFO - [Taxa] 63 alignments contain 56 taxa 2021-05-11 17:06:37,986 - phyluce_align_get_align_summary_data - INFO - [Taxa] 47 alignments contain 57 taxa 2021-05-11 17:06:37,986 - phyluce_align_get_align_summary_data - INFO - [Taxa] 42 alignments contain 58 taxa 2021-05-11 17:06:37,987 - phyluce_align_get_align_summary_data - INFO - [Taxa] 32 alignments contain 59 taxa 2021-05-11 17:06:37,987 - phyluce_align_get_align_summary_data - INFO - [Taxa] 32 alignments contain 60 taxa 2021-05-11 17:06:37,987 - phyluce_align_get_align_summary_data - INFO - [Taxa] 32 alignments contain 61 taxa 2021-05-11 17:06:37,987 - phyluce_align_get_align_summary_data - INFO - [Taxa] 23 alignments contain 62 taxa 2021-05-11 17:06:37,987 - phyluce_align_get_align_summary_data - INFO - [Taxa] 18 alignments contain 63 taxa 2021-05-11 17:06:37,987 - phyluce_align_get_align_summary_data - INFO - [Taxa] 21 alignments contain 64 taxa 2021-05-11 17:06:37,987 - phyluce_align_get_align_summary_data - INFO - [Taxa] 15 alignments contain 65 taxa 2021-05-11 17:06:37,987 - phyluce_align_get_align_summary_data - INFO - [Taxa] 9 alignments contain 66 taxa 2021-05-11 17:06:37,988 - phyluce_align_get_align_summary_data - INFO - [Taxa] 18 alignments contain 67 taxa 2021-05-11 17:06:37,988 - phyluce_align_get_align_summary_data - INFO - [Taxa] 9 alignments contain 68 taxa 2021-05-11 17:06:37,988 - phyluce_align_get_align_summary_data - INFO - [Taxa] 13 alignments contain 69 taxa 2021-05-11 17:06:37,988 - phyluce_align_get_align_summary_data - INFO - [Taxa] 6 alignments contain 70 taxa 2021-05-11 17:06:37,988 - phyluce_align_get_align_summary_data - INFO - [Taxa] 3 alignments contain 71 taxa 2021-05-11 17:06:37,988 - phyluce_align_get_align_summary_data - INFO - [Taxa] 7 alignments contain 72 taxa 2021-05-11 17:06:37,988 - phyluce_align_get_align_summary_data - INFO - [Taxa] 3 alignments contain 73 taxa 2021-05-11 17:06:37,988 - phyluce_align_get_align_summary_data - INFO - [Taxa] 5 alignments contain 75 taxa 2021-05-11 17:06:37,989 - phyluce_align_get_align_summary_data - INFO - [Taxa] 2 alignments contain 76 taxa 2021-05-11 17:06:37,989 - phyluce_align_get_align_summary_data - INFO - [Taxa] 3 alignments contain 77 taxa 2021-05-11 17:06:37,989 - phyluce_align_get_align_summary_data - INFO - [Taxa] 1 alignments contain 78 taxa 2021-05-11 17:06:37,989 - phyluce_align_get_align_summary_data - INFO - [Taxa] 1 alignments contain 83 taxa

brantfaircloth commented 3 years ago

It could be assembly, although the issue probably happened before that. Velvet can be somewhat of a sketchy assembler without read correction, so I'd be a little bit cautious. Still, that's unlikely to be the main issue.

I would go back and look at the output of match_contigs_to_probes. For the 5k bait set, you should be consistently getting about the same number of loci per individual within a species, that number should be high (~4400 for birds, for example), and those should be overlapping loci. If the number is really variable per individual or really low, more generally, (both of which could make your results look like the above), then the problem is most likely in the enrichment and/or sequencing.

For sequencing, and if enrichments worked well, you should be fine with 3-6 million reads per individual (1.5 to 3 million pairs/clusters per individual). if you have (approximately) that number of reads per individual (you can get by with less, but I'm being very general), then the problem may have occurred during the enrichment steps. If you don't have that number of reads per sample, then it might be good to look at what count of reads you do have and see how that count correlates w/ number of loci per sample.

These steps should help you figure out what's going on.

alexkrohn commented 3 years ago

Thanks for your help! This did help. It does look like we had variable success in either sequencing or enrichment, given the range in number of reads per individuals (see here and here). I didn't do the lab work on this project, but I'm still not sure why we got relatively few loci (max ~2,000) with 106 individuals of one species, and 2 outgroup individuals.

In general, do you suggest removing individuals with low reads from the beginning? It seems like since Phyluce keeps individuals separate until the alignment step, (given no time constraints with computing power) you wouldn't have to remove low quality individuals until just before alignment. What do you suggest?

brantfaircloth commented 3 years ago

Something sounds a little screwy here - seems like, overall, the number of reads is a little low (although it's sort of hard to tell with the boxplots). As for the max 2000 loci, that is also weird... but it sort of depends on what types of organisms these are (e.g. birds versus squamates versus mammals).

We usually remove individuals with VERY low read counts at the beginning (say, < 100,000 reads). Then, we assemble and see what the results look like for those that remain. Spades should give you slightly longer loci and potentially more loci than velvet, but, again, that is probably not the major issue here. Once we see what the counts of UCE loci per sample look like, we'll usually filter out additional poorly performing individuals/samples if those exist (I take the output of match_contigs_to_probes and stick in excel, then split on spaces, and you basically have your table of results for UCEs enriched per sample).

In terms of what could have gone sideways during labwork, that (again) depends on what organisms these are and potentially how the enrichments were performed (e.g. in pools of 8 libraries? in larger pools? etc.)? If the libraries were enriched commercially, I would ask them to send you the protocol followed, which may help clear things up.

alexkrohn commented 3 years ago

Thanks for your help. I'll have to ask around for what might have gone wrong in the lab work.

For now I'll try re-running the analysis with fewer individuals from the start, then removing again any individuals that have very few UCE matches. Thanks again!

faircloth-lab / phyluce

phyluce_align_get_only_loci_with_min_taxa removes too many alignments #225