HuMingLab / SnapHiC

SnapHiC: Single Nucleus Analysis Pipeline for Hi-C Data
GNU General Public License v3.0
35 stars 11 forks source link

Questions about resourses consume #1

Closed skelviper closed 3 years ago

skelviper commented 3 years ago

Hi! Great job! Really useful tools for scHi-C data and I really want to use it in my research. Here are my questions:

  1. Seems RWR step is taking too much memory, is there any way to optimise it? For example I only have 2 nodes with 96 threads with about 400GB of memory, it takes forever to process thousands of cells.

  2. when I set

    parallelism="threaded" number_of_processors=10 snapHiC still takes all of my 96 threads and I can hardly do any other things on my machine, any ideas why?

  3. Since this pipeline taking so long , is it possible to release a snakemake version of step one instead of the existing mpirun version? They have really nice features and you don't need a rerun after some error.

  4. This is just a suggestion, I'm using dip-c+hickit+hg19 for my pipeline, my pairs file looks like this

pairs format v1.0

sorted: chr1-chr2-pos1-pos2

shape: upper triangle

chromosome: 1 249250621

chromosome: 2 243199373

chromosome: 3 198022430

chromosome: 4 191154276

chromosome: 5 180915260

chromosome: 6 171115067

chromosome: 7 159138663

chromosome: 8 146364022

chromosome: 9 141213431

chromosome: 10 135534747

chromosome: 11 135006516

chromosome: 12 133851895

chromosome: 13 115169878

chromosome: 14 107349540

chromosome: 15 102531392

chromosome: 16 90354753

chromosome: 17 81195210

chromosome: 18 78077248

chromosome: 19 59128983

chromosome: 20 63025520

chromosome: 21 48129895

chromosome: 22 51304566

chromosome: X 155270560

columns: readID chr1 pos1 chr2 pos2 strand1 strand2 phase0 phase1

. 1 724716 1 224200208 + + . . . 1 725830 1 224203159 + + . . . 1 726868 1 1491588 + + . . . 1 758626 1 794330 + + . . . 1 818997 1 1487155 + + . . . 1 843085 1 849002 + + . . . 1 847442 1 193311696 + + . .

maybe consider to add support for hickit pipeline( remove lines start with # and add support for chromosome like 1 instead of chr) but that's okay, it's not difficult to convert them.

Thanks!

armenabnousi commented 3 years ago

Thank you for your feedback!

  1. We are working on optimizing the random walk step, or maybe approximating it but unfortunately we don't have a clear timeline of when we will be able to optimize this process.
  2. By setting the "number_of_processors" to 10, snapHiC distributes the data among 10 processes to run the random walk. Each process uses numpy for computation of the RWR. Depending on your installation of numpy, it might automatically try to use all cores on your system. Please see this QA. You should be able to prevent this behaviour by: export OPENBLAS_MAIN_FREE=1 or export OPENBLAS_NUM_THREADS = 1.
  3. We will consider releasing a snakemake version, but current version supports the feature you mention in the RWR step. If for some reason your job terminates (due to error or you stop it), if you restart the run previously computed random walk values will not be computed again and the random walk step will continue where it was left off.
  4. Thank you for this suggestion. This is very helpful, we will add functionalities for different input file format in future releases.