ComparativeGenomicsToolkit / hal

Hierarchical Alignment Format
Other
164 stars 39 forks source link

hal2mafMP.py speed issue #218

Open pskvins opened 3 years ago

pskvins commented 3 years ago

Hello,

I'm currently trying to convert two Hal files (241-way mammalian alignment update (V2), 363-way avian alignment) from https://cglgenomics.ucsc.edu/data/cactus/ to Mac file with hal2mafMP.py, but I got into the problem where the program is too slow even with setting --numProc 92.

I used the command srun -c 48 -t 40-0 hal2mafMP.py /big/sukhwan/363-avian-2020.hal.1 /big/sukhwan/363_avian/53birds.maf --numProc 96 --splitBySequence --smallSize 10000000000 --targetGenomes Gallus_gallus,Coturnix_japonica,Meleagris_gallopavo,Tyto_alba,Buceros_rhinoceros,Anas_platyrhynchos,Apaloderma_vittatum,Calypte_anna,Cuculus_canorus,Charadrius_vociferus,Fulmarus_glacialis,Tauraco_erythrolophus,Opisthocomus_hoazin,Phoenicopterus_ruber,Columba_livia,Leptosomus_discolor,Merops_nubicus,Pelecanus_crispus,Phalacrocorax_carbo,Phaethon_lepturus,Pterocles_gutturalis,Nipponia_nippon,Egretta_garzetta,Pygoscelis_adeliae,Aptenodytes_forsteri,Cariama_cristata,Mesitornis_unicolor,Eurypyga_helias,Balearica_regulorum,Chlamydotis_macqueenii,Falco_cherrug,Falco_peregrinus,Aquila_chrysaetos,Haliaeetus_albicilla,Haliaeetus_leucocephalus,Corvus_brachyrhynchos,Corvus_cornix,Acanthisitta_chloris,Ficedula_albicollis,Serinus_canaria,Zonotrichia_albicollis,Geospiza_fortis,Taeniopygia_guttata,Pseudopodoces_humilis,Gavia_stellata,Antrostomus_carolinensis,Melopsittacus_undulatus,Colius_striatus,Picoides_pubescens,Struthio_camelus,Tinamus_guttatus I put --splitBySequence --smallSize 10000000000 to prevent issue about --numProc which I find from https://github.com/ComparativeGenomicsToolkit/hal/issues/195 this issue.

I calculated the time that the program would take to produce the result be about 36 days if I run srun -c 48 -t 40-0 hal2mafMP.py /big/sukhwan/363-avian-2020.hal.1 /big/sukhwan/363_avian/53birds.maf --numProc 96 --splitBySequence --smallSize 10000000000, assuming the result maf file will be 6TB (producing about 1.2 GB / 10 min).

Do you know any solutions to solve this speed issue? I would appreciate it if you can help me.