faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
80 stars 49 forks source link

Memory issue - phyluce_probe_remove_duplicate_hits_from_probes_using_lastz #252

Open karlyhiggins opened 3 years ago

karlyhiggins commented 3 years ago

I am trying to create a probe set for Symbiodinium but am stuck on the duplication removal step due to a memory issue. I was able to complete the run using a selection of Sgou+3.temp.probes-TO-SELF-PROBES.lastz at 3,851.0 loci and the majority were removed as duplicates but this did not leave many loci for probes. I am trying to expand the run using Sgou+2.temp.probes-TO-SELF-PROBES.lastz at 11,650.0 loci with the majority likely also being removed as duplicates but our HPC does not offer enough memory to remove the dups from this 300GB file. Do you have any suggestions for working around this?

phyluce_probe_remove_duplicate_hits_from_probes_using_lastz --fasta /home/khiggins/medusozoa/Sgou+2.temp.probes --lastz /home/khiggins/medusozoa/Sgou+2.temp.probes-TO-SELF-PROBES.lastz --probe-prefix=uce-

brantfaircloth commented 3 years ago

I don't have great solutions. One thing you might try is to split the probe file into pieces (e.g. your 300 GB file into 5, 60 GB files), run lastz of the split pieces against each other, remove duplicates from those, then align (with lastz) the duplicate removed bait to one another and go through another round of duplicate removal - that will at least reduce the file sizes.