DiltheyLab / HLA-LA

Fast HLA type inference from whole-genome data
GNU General Public License v3.0
121 stars 41 forks source link

Indexing memory usage and other things #3

Closed serge2016 closed 7 years ago

serge2016 commented 7 years ago

Hello! I am running the indexing process inside the Docker-container on a server with 30 Gb RAM + 8 Gb swap and 8 CPUs. In first 15 minutes of indexing RAM usage grew up to all RAM (29.3 of 29.5 Gb) and to 4 Gb of swap.

  1. I think it would be better to mention it somewhere in prerequests.
  2. Also the process took only 1 CPU of 8. Is it possible to use more?
  3. As I understand, the indexing process generates 2 files (9.2 Gb):
    -rw-r--r--  1 root  502 5491768320 Jun 19 17:41 serializedGRAPH
    -rw-r--r--  1 root  502 4197127944 Jun 19 17:41 serializedGRAPH_preGapPathIndex

    Does the program need all files from source PRG_MHC_GRCh38_withIMGT.tar.gz or only new 2 files?

All process took 2 hours and 20 minutes.

  1. I made a second run and got a bit different files:
    -rw-r--r--  1 root  502 5491768268 Jun 20 07:17 serializedGRAPH
    -rw-r--r--  1 root  502 4197127912 Jun 20 07:17 serializedGRAPH_preGapPathIndex

    How do I have to understand, that indexing is done correctly?

AlexanderDilthey commented 7 years ago

Hi Serge,

  1. Good suggestion, will mention RAM usage.
  2. No, indexing is single-threaded -- but you only have to do it once.
  3. All files
  4. I don't think this process is necessarily deterministic in terms of file size. The test run on the NA12878 CRAM will tell you whether everything worked.