faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

Keep duplicates "contigs hitting multiple probes" #328

Open sbabbbie opened 4 months ago

sbabbbie commented 4 months ago

Hi there, I have a duplicate file from the --keep duplicates flag. However, I'm confused about how to automate retrieving the contigs that map to multiple UCEs. Because I am working with very small genomes, many of my UCEs seem to be close enough together that the assembled contigs cover multiple UCEs, but I would still like to include these loci in my downstream analysis rather than just dropping them. But I'm not sure of the most efficient way to do this. I see you have scripts for the opposite issue (phyluce_assembly_parse_duplicates_file.py retrieves contigs under "probes hitting multiple contigs" rather than "contigs hitting multiple probes" which is what I need). I've tried editing this script to look at contigs hitting multiple probes instead, but I just keep getting blank output files.

Would appreciate any advice!

brantfaircloth commented 4 months ago

Howdy,

What types of data are you inputting? If loci are proximate to one another in the assemblies you have, it might be worthwhile to consider following the "harvesting loci from genomes" approach (e.g. Tutorial 3) and reducing the distance sliced from the "core" of each UCE locus identified (within a given contig). Then, input those genome slices to the normal approach.

Just keep in mind that if the loci are VERY proximate to one another, you are not getting a independent-ish draw from the genome.

sbabbbie commented 4 months ago

Thank you, that's a very useful suggestion! I am working with contigs assembled in SPADES from raw next gen sequencing data, trying to identify what UCEs I have represented. Luckily I have many UCEs from all over the genome and they are not ALL very proximate to one another, but there's definitely some that are close enough together that they're getting assembled and then hitting multiple probes. It messes up my analysis to have them all dropped since I have an underestimate of locus representation across taxa. I will try the harvesting loci from genomes approach and see if that solves my issue!

brantfaircloth commented 4 months ago

Another option would be to switch to guided assembly of your contigs based on the probe sequence (e.g. as in aTram or itero - but that might not work so well if your reads are divergent from your baits/loci.