memory reduction of region extraction pipeline

haoyueshuai commented 4 years ago

Hi @tfabiha, here is the discussion for the memory issue of our region extraction pipeline.

To do list:

[ ] Fabiha: To reproduce both versions and see if she can get an idea based on both of them and to decrease more
[ ] Fabiha: To figure out how memory profiling works and how python works in memory-wise, to see if the memory used is accumulated for all the steps or it release some of it used from the previous steps. We should also take into consideration that the relative measures such as percentage of reduction are more informative than an absolute change of memory.
[x] Haoyue: since yale’s cluster is under maintenance, I will try to figure out if we could work on Columbia's cluster, also give you some of the real data from there, hopefully, you can get some idea of the real data, and see where might need more memories when data scales up.

tfabiha commented 4 years ago

Since the new version of the pipeline uses memory unidentifiable through the current memory profiler I will be updating the docker image to test the pipeline with other memory profilers such as filprofiler and guppy/heapy to see if they can identify the memory usage in a clearer way.

gaow commented 4 years ago

@tfabiha are you saying the memory profile and bottleneck you identified using the Python script is not what causes the huge memory usage of the pipeline; and you are thinking there are some other lines of code outside the Python script but in the pipeline that causes the trouble? Because everything in the Python script should be "identifiable" by current memory profiler, right? I'm trying to understand what exactly the problem we are dealing with.

haoyueshuai commented 4 years ago

@gaow @tfabiha is taking about this jump here from 127.8MiB to 148.9 MiB (line 114->115), and only 2.5MiB is captured in the Increment column, the usage of the other 18.6MiB is not clear:

   111                                  # Calculate the LD matrix based on unrelated individuals
   112    127.8 MiB       0.0 MiB       print(f'{time.strftime("%H:%M:%S", t)}: Calculating LD matrix ...')
   113                             
   114    127.8 MiB       0.0 MiB       unr_iid=list() # create a list to store iid in the unrelated file
   115    148.9 MiB      2.5 MiB        for i in unr.IID.items():
   116    148.9 MiB      0.3 MiB           unr_iid.append(str(i[1]))

gaow commented 4 years ago

@tfabiha good catch! I never noticed that. Looking to hear more about it! Thanks @haoyueshuai for clarifying

haoyueshuai commented 4 years ago

Thank you @tfabiha for your hard work even during midterm weeks! Hi @gaow for now, @tfabiha has reduced the total memory use for that function(extract_region) from 4.1 MiB to 1.1 MiB, for MWE, and that line that used most memories now only use 0.2 MiB. 107 115.8 MiB 0.2 MiB iid_unr = rg_fam.index.intersection(pd.Index(unr.IID)) #order based on rg_fam However, when I tested on the real data on the cluster, it still does not work.

I've made the memory profiling of MWE on the cluster, for now, the most memory use for the entire program is to read in the data, shown as follows.

   126     96.9 MiB     96.9 MiB   @profile
   127                             def main():    
   128                                 # Load the file of summary statistics and standardize it.
   129     97.5 MiB      0.6 MiB       sumstats = read_sumstat('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/phenotypes_BMI.fastGWA.snp_stats.gz', '.' if False else None)
   130                                 
   131                                 # Load phenotype file
   132     98.0 MiB      0.5 MiB       pheno = pd.read_csv('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/phenotypes.txt', header=0, delim_whitespace=True, quotechar='"')
   133                             # Load unrelated sample file
   134    128.4 MiB     30.4 MiB       unr = pd.read_csv('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/unrelated_samples.txt', header=0, delim_whitespace=True, quotechar='"')
   135                                 
   136                             # Load genotype file for the region of interest
   137    128.4 MiB      0.0 MiB       geno_inventory = dict([x.strip().split() for x in open('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/genotype_inventory.txt').readlines() if x.strip()])
   138    128.4 MiB      0.0 MiB       chrom = "21"
   139    128.4 MiB      0.0 MiB       if chrom.startswith('chr'):
   140                                     chrom = chrom[3:]
   141    128.4 MiB      0.0 MiB       if chrom not in geno_inventory:
   142                                     geno_file = geno_inventory['0']
   143                                 else:
   144    128.4 MiB      0.0 MiB           geno_file = geno_inventory[chrom]
   145    128.4 MiB      0.0 MiB       import os
   146    128.4 MiB      0.0 MiB       if not os.path.isfile(geno_file):
   147                                     # relative path
   148    128.4 MiB      0.0 MiB           if not os.path.isfile('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/' + geno_file):
   149                                         raise ValueError(f"Cannot find genotype file {geno_file}")
   150                                     else:
   151    128.4 MiB      0.0 MiB              geno_file = '/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/' + geno_file
   152    128.4 MiB      0.0 MiB       if geno_file.endswith('.bed'):
   153                                     plink = True
   154                                     from pandas_plink import read_plink
   155                                     geno = read_plink(geno_file)
   156    128.4 MiB      0.0 MiB       elif geno_file.endswith('.bgen'):
   157    128.4 MiB      0.0 MiB           plink = False
   158    128.6 MiB      0.2 MiB           from pybgen import PyBGEN
   159    129.1 MiB      0.5 MiB           bgen = PyBGEN(geno_file)
   160    129.1 MiB      0.0 MiB           sample_file = geno_file.replace('.bgen', '.sample')
   161    129.1 MiB      0.0 MiB           if not os.path.isfile(sample_file):
   162    129.1 MiB      0.0 MiB               if not os.path.isfile('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/imputed_genotypes.sample'):
   163                                             raise ValueError(f"Cannot find the matching sample file ``{sample_file}`` for ``{geno_file}``.\nYou can specify path to sample file for all BGEN files using ``--bgen-sample-path``.")
   164                                         else:
   165    129.1 MiB      0.0 MiB                   sample_file = '/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/imputed_genotypes.sample'
   166    129.1 MiB      0.0 MiB           bgen_fam = pd.read_csv(sample_file, header=0, delim_whitespace=True, quotechar='"',skiprows=1)
   167    129.1 MiB      0.0 MiB           bgen_fam.columns = ['fid','iid','missing','sex']
   168    129.1 MiB      0.0 MiB           geno = [bgen,bgen_fam]
   169                                 else:
   170                                     raise ValueError('Plesae provide the genotype files with PLINK binary format or BGEN format')
   171                                 
   172    138.5 MiB      9.4 MiB       rg_info = extract_region((int(chrom), 48096251, 48114083), sumstats, geno, pheno, unr, plink)
   173    138.7 MiB      0.2 MiB       rg_info['stats'].to_csv('/home/hs863/MWE/region_extract/21_48096251_48114083/phenotypes_BMI.fastGWA.snp_stats_21_48096251_48114083.sumstats.gz', sep = "\t", header = True, index = True)
   174    138.8 MiB      0.1 MiB       rg_info['geno'].to_csv('/home/hs863/MWE/region_extract/21_48096251_48114083/phenotypes_BMI.fastGWA.snp_stats_21_48096251_48114083.genotype.gz', sep = "\t", header = True, index = True)
   175    138.8 MiB      0.0 MiB       rg_info['pld'].to_csv('/home/hs863/MWE/region_extract/21_48096251_48114083/phenotypes_BMI.fastGWA.snp_stats_21_48096251_48114083.population_ld.gz', sep = "\t", header = True, index = True)
   176    138.8 MiB      0.0 MiB       rg_info['sld'].to_csv('/home/hs863/MWE/region_extract/21_48096251_48114083/phenotypes_BMI.fastGWA.snp_stats_21_48096251_48114083.sample_ld.gz', sep = "\t", header = True, index = True)

Then I tried to only run the read in part of the real data on the cluster, it did not crush, so I made the memory profiling of that part, as follows.

Line #    Mem usage    Increment   Line Contents
================================================
    24     64.7 MiB     64.7 MiB   @profile
    25                             def read_in():
    26                                 # Load the file of summary statistics and standardize it.
    27    930.8 MiB    866.1 MiB       sumstats = read_sumstat('/gpfs/gibbs/pi/dewan/data/UKBiobank/results/FastGWA_results/results_imputed_data/T2D_091120/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_030720_T2D.fastGWA.snp_stats.gz', '/home/hs863/project/UKBB_GWAS_DEV/data/fastGWA_template.yml' if False else None)
    28                             
    29                                 # Load phenotype file
    30    971.5 MiB     40.7 MiB       pheno = pd.read_csv('/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_030720', header=0, delim_whitespace=True, quotechar='"')
    31                                 # Load unrelated sample file
    32    972.0 MiB      0.5 MiB       unr = pd.read_csv('/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620', header=0, delim_whitespace=True, quotechar='"')
    33                                 # Load genotype file for the region of interest
    34    972.4 MiB      0.4 MiB       geno_inventory = dict([x.strip().split() for x in open('/gpfs/gibbs/pi/dewan/data/UKBiobank/results/UKBB_bgenfilepath.txt').readlines() if x.strip()])

To summarize the situation here: The most memory used for MWE is to read in the data, but it is not the case for the real data. @tfabiha is still thinking about this. @gaow what do you think about it, what do you think is the best way to process now? Thank you! Besides, @tfabiha still does not have the access to the yale cluster, so she may be able to lead this issue after she gets the access.

gaow commented 4 years ago

Thank you @tfabiha !

I guess the next step is to figure out still which step would it cause a crash on the cluster. @haoyueshuai could you take a Python script directly for a region that crashed, and reserve an interactive computing node on the cluster with largest memory you can possibly obtain, then run the profile on that real data region, so we can hopefully complete it and see the actual memory usage? This will help us see why it happens. Sometimes maybe an alternative solution has even greater space complexity than the original solution so what helps on the MWE might not help or even worse on real data ...

haoyueshuai commented 4 years ago

Thank you @tfabiha !

I guess the next step is to figure out still which step would it cause a crash on the cluster. @haoyueshuai could you take a Python script directly for a region that crashed, and reserve an interactive computing node on the cluster with largest memory you can possibly obtain, then run the profile on that real data region, so we can hopefully complete it and see the actual memory usage? This will help us see why it happens. Sometimes maybe an alternative solution has even greater space complexity than the original solution so what helps on the MWE might not help or even worse on real data ...

@gaow Thank you! Considering your idea, we can try to find out a region that does not crash and ran successfully the last time, and run it again with the memory profiler. Since we could not get the memory profiling if the program crash in the middle of running. And I am not sure if I can get the memory larger than 60G on the interactive node, we will try to see. Besides, I can also try to run the python script on the jupyter notebook to see which line makes it crash. @tfabiha just let me know that she is going to work with the cluster soon, I will help her get it set up ASAP so both of us can work on it. Hopefully, with all these, we can get some clue about what's the real problem.

cumc / bioworkflows

memory reduction of region extraction pipeline #13