faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

remove_duplicate_hits_from_probes stuck for days #242

Open sihellem opened 2 years ago

sihellem commented 2 years ago

Hi Brant,

First, thanks for this amazing resource and accompanying tutorials, they are really helpful.

I am now stuck at the removal of duplicates using 'phyluce_probe_remove_duplicate_hits_from_probes_using_lastz'. After 1-2 days, the job still does not produce any output nor logs.

It must probably be because the file resulting from 'phyluce_probe_easy_lastz --identity 50 --coverage 50' is enormous (~88Gb).

Is there anything to do beside to wait?

To give a bit more context, I have been designing probes for our current project using 4 insect genomes (phylogenetically very distant), and am finding a really huge number of loci (but maybe this is normal): Loci shared by Base + 0 taxa: 1,884,627.0 Loci shared by Base + 1 taxa: 1,884,627.0 Loci shared by Base + 2 taxa: 424,868.0 Loci shared by Base + 3 taxa: 175,535.0

I opted for the 'Base + 3 taxa' set, used 180-bp for buffering and --tiling-density 2.

Thanks in advance for any input!

brantfaircloth commented 2 years ago

Hello. I think the problem is very likely to be the size of the file and the number of baits created. This number of loci seems pretty high - it would probably be best to try and reduce this number in some way before designing baits to capture them. Typically, I design baits targeting 1000-5000 loci.