baoxingsong / dCNS

conserved non-coding sequence
MIT License
14 stars 4 forks source link

huge resource consumption #28

Open naturalstay opened 1 year ago

naturalstay commented 1 year ago

Hi, Dr Song. Both my genomes are around 350Mb. I ran this commandls | awk '{print("dCNS cut1Gap -ra masked_CMJ_k20_57_cds.fa -qa masked_Fhi_k20_33.fa -i "$1" -r reference -o "$1".5")}' > command1, 42398 subcommands were generated. I wrote python's process pool (36 cores) to parallelize it, but there are many subcommands that will consume a lot of memory and run for more than 15 minutes, is this normal? Does this conform to the algorithm of the software? I would like to know your resource consumption for reference. Looking forward to your reply.

baoxingsong commented 1 year ago

Thanks. Yes, that is normal. dCNS is very sensitive, and thus computationally costly. The computational resource cost could be reduced by further optimizing the code. However, I do not have time to do that at this moment.

You could start with more processes initially, and iteratively rerun those failed jobs with fewer processes.