esolares / HapSolo

Reduction of Althaps and Duplicate Contigs for Improved Hi-C Scaffolding
GNU General Public License v2.0
19 stars 6 forks source link

Alternative to Slurm-arrays #2

Closed sivico26 closed 4 years ago

sivico26 commented 4 years ago

Hi, I was thinking of commenting issue #1, but I think a new issue would be more appropriate for future reference.

First of all, thanks for developing this program, the promising results on the paper made me give it a try. Having said that, I consider the Slurm-array (I think this would translate to other queue systems as well, but I can only talk for Slurm) approach to generate the Busco and Blat inputs can be problematic for assemblies with a high number of scaffolds.

In my case, I had about ~13000 sequences and the job limit on my cluster was 4600, which meant I could divide my job into 3 arrays, great. This also meant that I had to make the queue 3 times for the same job. I underestimated this until I realized it took more than the actual computation time. This, of course, depends on the community use of your server/cluster. I also had to constantly monitor the progress of the array to see the best moment to launch the next array to minimize the queued time.

There are other IT issues too. In my Slurm system, the limit job number per user is the same 4600. This meant that if I launch an array of this size I could not launch any other jobs. This alleviates itself as the jobs of the array get processed. But it also meant I could not send the array of Busco and the array of Blat at the same time.

I eventually inquired my system administrator why 4600 was the limit. He told me that the Slurm queue algorithm has to process every job to know when to allocate and update constantly when new jobs come in, change of status or resources become available. This algorithm becomes slower proportionally to the number of jobs in the queue and can even collapse if there are too many. That's why there are a job per user limit and a global job limit (around 15000 in my cluster).

Overall I think I ended worrying for many IT things to accelerate my results instead of thinking on my research.

So I wanted to propose an alternative that I applied myself. I adapted the Busco and Blat sbatch scripts to use GNU Parallel. The main benefits that I got:

So I propose GNU parallel as an alternative to Slurm arrays if that is the default. I would be happy to contribute my parallel scripts if that helps. It is worth to mention that there are alternatives to parallel though.

I saw the minimap2 alternative to Blat is in the works (by the way, how is that going? Is it ready?), but parallel would still make sense for the Busco part. However, I wonder if it could be avoided as well. Can not we make a Busco of the whole assembly and parse the output to generate the score for every scaffold? It would be equivalent to make it one by one, and faster. I made a parser not long ago, which updated the Busco score of a subassembly based on the score of an assembly. I guess it should work for, or could be extended to, individual scaffolds. I also think something similar could be done for the Quast part.

Well, this got longer than I expected, but I hope you find some of it useful. I really would like to know your thoughts on this.

esolares commented 4 years ago

Hi,

My apologies for the late reply. I did not realize I had to manually turn on notifications in GitHub...

Thank you for your very good thoughtful comments and recommendations. Yes GNUParrallel can be used, and I had tried to use it initially but found SLURM to run much faster for my use case. Obviously not all of us have the same computational resources available, so I will implement your suggestion. Yes the preprocessing can take a bit, BLAT being the longest. BUSCO's are fairly short though. I did in the beginning try to do as you had recommended and only run BUSCO once on the entire genome, but found it was actually finding duplicate BUSCO's in contigs that did not contain BUSCO hits to begin with. I resolved this by running BUSCO on each individual contig, as doing so gives us more information than running BUSCO on the whole assembly.

With respect to using minimap2, it actually works pretty good. I did have to tweak alignment settings though. I will post the alignment settings that worked best for me on the minimap2 branch later today. The minimap2 results were on par with those using BLAT, albeit a tad bit different. I will probably update the paper to include those results and implement minimap2 as an option into the main branch.

Please let me know if you have any more issues or suggestions. Now with notifications on, I should be able to reply much faster.

Thank you,

Edwin

sivico26 commented 4 years ago

Hi!, thanks for the reply,

Good to know that you were familiar with parallel. If Slurm arrays are faster I totally get it, although I think this can not be the case if you have to divide the job into several arrays and include the queued time.

I did not know about that issue with Busco, I will take a look to see if it happens with my data as well.

Regarding minimap2 I will give a try. Do you have any comments on the differences between the three options you suggest for minimap2 alignment?

As a suggestion, it would be useful to add a link to the preprint in the README, maybe a "how to cite" section.

I do have problems running Hapsolo that I suspect are abnormal. But I will post them on another issue.

Regards

esolares commented 4 years ago

Hi,

Yes, I was able to run it in SLURM fairly quickly in XSEDE PSC Bridges and SDSU Comet, as well as on my own cluster. I ended up building my own cluster to run and test my own jobs. I can see now that in many cases it maybe better to use GNU Parallel. I will add my GNU Parallel scripts that worked on there. There is also parallel Blat (pblat)

I believe the middle one worked the best, but that was on mosquito. I think the minimap2 parameters need further tweaking, but it's a good starting place. You could just use the 2nd alignment parameter options and go with that. I did use those for the VGP talk I gave a few weeks ago on mosquito and the results were pretty good comparably.

Yes, please post any issues you may be having. The more bugs that are caught the better.

Thank you again for you suggestions.

Edwin

sivico26 commented 4 years ago

Thank you, depending on how we manage issue 3, I will implement your suggestion about minimap2.

Sivico

faguil commented 3 years ago

Hi guys,

I want to give a try to HapSolo and compare the results with Purge_Haplotigs on my genome assembly. I am having problems running HapSolo (sbatch_blat.sh and sbatch_busco.sh), but I saw this post and am wondering if sivico26 can share your script to run blat and busco with GNU parallel.

Best,

Felipe

esolares commented 3 years ago

Hi. I have a script that you can use for running blat on gnu parallel, since each only takes a single core. for busco, I recommend running a for loop for it. I also have that script available some where. I can try to get it to you by tomorrow. If I find it earlier I will reply to you here.

faguil commented 3 years ago

HapSolo sounds promising and I really want to give it a try. Thanks for your reply and well willingness to share your scripts with me. If it makes it easier, my email is faguilera@udec.cl

Best,

Felipe

esolares commented 3 years ago

Hi,

I have created 3 new scripts for running blat and busco using GNU Parallel. Please let me know if you get any errors. The scripts are located in the scripts folder and are:

Run's blat using GNU Parallel bash_gnuparallelblat.sh

Run's Quast and BUSCO in GNU Parallel using a helper script bash_gnuparallelbusco.sh bash_quastbusco.sh

Thank you,

Edwin