dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 41 forks source link

step 3 slow #561

Closed ospfsg closed 3 months ago

ospfsg commented 5 months ago

Hi

I am running ipyrad v0.9.95 , with 180 pared end GBS with samples, denovo. In step 3 after join unmerged pairs the next, clustering and mapping, is still at 0%, after 19 hours. I am using 50 cores and almost 300GB of RAM in use (can go until 512).

Samples were preprocess with fastp to remove smaller than 80 read length, overrepresentative, poly G and quality less than 20 and remove adapters

I already have filter adapter to stricter (2), phred Qscore offset at 33.

I started with this command: ipyrad -p params-lim_20240627.txt -s 1234567 -c 50 --MPI -t 100

any suggestion how to make this faster? osp

isaacovercast commented 5 months ago

Hello, thank u for sending such a detailed account of the issue, its helpful. I will bet that the problem is '-t 100'. The -t argument specifies the number of threads per core for clustering, the default is 2. With -t 100 i bet its just crunching on spawning too many threads. Try leaving -t as default.

ospfsg commented 5 months ago

Hi I removed the -t but is back to the same place...17 hours 0% in step 3 after join unmerged pairs the next, clustering and mapping....

I will keep this running for a while but I am going to try a subset in another server...

thank you osp

isaacovercast commented 5 months ago

How many raw reads per sample after step 1? How long are the reads? 150bp for R1/R2 or longer? There are many factors that can influence the time for step 3 including # of raw reads per sample, length of reads (particularly for paired-end data), genome size, size selection window, frequency of enzyme recognition sites. If you can give me more ideas about all these things I can help a bit, but at the end of the day sometimes things just take more time.

isaacovercast commented 3 months ago

Did you ever get step 3 to complete? Or to run a bit faster? If you would like more ideas on performance as a function of the format of your data I'm happy to help if u would like to re-open this issue.