Open haoyueshuai opened 4 years ago
Since the new version of the pipeline uses memory unidentifiable through the current memory profiler I will be updating the docker image to test the pipeline with other memory profilers such as filprofiler and guppy/heapy to see if they can identify the memory usage in a clearer way.
@tfabiha are you saying the memory profile and bottleneck you identified using the Python script is not what causes the huge memory usage of the pipeline; and you are thinking there are some other lines of code outside the Python script but in the pipeline that causes the trouble? Because everything in the Python script should be "identifiable" by current memory profiler, right? I'm trying to understand what exactly the problem we are dealing with.
@gaow @tfabiha is taking about this jump here from 127.8MiB to 148.9 MiB (line 114->115), and only 2.5MiB is captured in the Increment column, the usage of the other 18.6MiB is not clear:
111 # Calculate the LD matrix based on unrelated individuals
112 127.8 MiB 0.0 MiB print(f'{time.strftime("%H:%M:%S", t)}: Calculating LD matrix ...')
113
114 127.8 MiB 0.0 MiB unr_iid=list() # create a list to store iid in the unrelated file
115 148.9 MiB 2.5 MiB for i in unr.IID.items():
116 148.9 MiB 0.3 MiB unr_iid.append(str(i[1]))
@tfabiha good catch! I never noticed that. Looking to hear more about it! Thanks @haoyueshuai for clarifying
Thank you @tfabiha for your hard work even during midterm weeks!
Hi @gaow for now, @tfabiha has reduced the total memory use for that function(extract_region) from 4.1 MiB to 1.1 MiB, for MWE, and that line that used most memories now only use 0.2 MiB.
107 115.8 MiB 0.2 MiB iid_unr = rg_fam.index.intersection(pd.Index(unr.IID)) #order based on rg_fam
However, when I tested on the real data on the cluster, it still does not work.
126 96.9 MiB 96.9 MiB @profile
127 def main():
128 # Load the file of summary statistics and standardize it.
129 97.5 MiB 0.6 MiB sumstats = read_sumstat('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/phenotypes_BMI.fastGWA.snp_stats.gz', '.' if False else None)
130
131 # Load phenotype file
132 98.0 MiB 0.5 MiB pheno = pd.read_csv('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/phenotypes.txt', header=0, delim_whitespace=True, quotechar='"')
133 # Load unrelated sample file
134 128.4 MiB 30.4 MiB unr = pd.read_csv('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/unrelated_samples.txt', header=0, delim_whitespace=True, quotechar='"')
135
136 # Load genotype file for the region of interest
137 128.4 MiB 0.0 MiB geno_inventory = dict([x.strip().split() for x in open('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/genotype_inventory.txt').readlines() if x.strip()])
138 128.4 MiB 0.0 MiB chrom = "21"
139 128.4 MiB 0.0 MiB if chrom.startswith('chr'):
140 chrom = chrom[3:]
141 128.4 MiB 0.0 MiB if chrom not in geno_inventory:
142 geno_file = geno_inventory['0']
143 else:
144 128.4 MiB 0.0 MiB geno_file = geno_inventory[chrom]
145 128.4 MiB 0.0 MiB import os
146 128.4 MiB 0.0 MiB if not os.path.isfile(geno_file):
147 # relative path
148 128.4 MiB 0.0 MiB if not os.path.isfile('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/' + geno_file):
149 raise ValueError(f"Cannot find genotype file {geno_file}")
150 else:
151 128.4 MiB 0.0 MiB geno_file = '/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/' + geno_file
152 128.4 MiB 0.0 MiB if geno_file.endswith('.bed'):
153 plink = True
154 from pandas_plink import read_plink
155 geno = read_plink(geno_file)
156 128.4 MiB 0.0 MiB elif geno_file.endswith('.bgen'):
157 128.4 MiB 0.0 MiB plink = False
158 128.6 MiB 0.2 MiB from pybgen import PyBGEN
159 129.1 MiB 0.5 MiB bgen = PyBGEN(geno_file)
160 129.1 MiB 0.0 MiB sample_file = geno_file.replace('.bgen', '.sample')
161 129.1 MiB 0.0 MiB if not os.path.isfile(sample_file):
162 129.1 MiB 0.0 MiB if not os.path.isfile('/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/imputed_genotypes.sample'):
163 raise ValueError(f"Cannot find the matching sample file ``{sample_file}`` for ``{geno_file}``.\nYou can specify path to sample file for all BGEN files using ``--bgen-sample-path``.")
164 else:
165 129.1 MiB 0.0 MiB sample_file = '/gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/imputed_genotypes.sample'
166 129.1 MiB 0.0 MiB bgen_fam = pd.read_csv(sample_file, header=0, delim_whitespace=True, quotechar='"',skiprows=1)
167 129.1 MiB 0.0 MiB bgen_fam.columns = ['fid','iid','missing','sex']
168 129.1 MiB 0.0 MiB geno = [bgen,bgen_fam]
169 else:
170 raise ValueError('Plesae provide the genotype files with PLINK binary format or BGEN format')
171
172 138.5 MiB 9.4 MiB rg_info = extract_region((int(chrom), 48096251, 48114083), sumstats, geno, pheno, unr, plink)
173 138.7 MiB 0.2 MiB rg_info['stats'].to_csv('/home/hs863/MWE/region_extract/21_48096251_48114083/phenotypes_BMI.fastGWA.snp_stats_21_48096251_48114083.sumstats.gz', sep = "\t", header = True, index = True)
174 138.8 MiB 0.1 MiB rg_info['geno'].to_csv('/home/hs863/MWE/region_extract/21_48096251_48114083/phenotypes_BMI.fastGWA.snp_stats_21_48096251_48114083.genotype.gz', sep = "\t", header = True, index = True)
175 138.8 MiB 0.0 MiB rg_info['pld'].to_csv('/home/hs863/MWE/region_extract/21_48096251_48114083/phenotypes_BMI.fastGWA.snp_stats_21_48096251_48114083.population_ld.gz', sep = "\t", header = True, index = True)
176 138.8 MiB 0.0 MiB rg_info['sld'].to_csv('/home/hs863/MWE/region_extract/21_48096251_48114083/phenotypes_BMI.fastGWA.snp_stats_21_48096251_48114083.sample_ld.gz', sep = "\t", header = True, index = True)
Line # Mem usage Increment Line Contents
================================================
24 64.7 MiB 64.7 MiB @profile
25 def read_in():
26 # Load the file of summary statistics and standardize it.
27 930.8 MiB 866.1 MiB sumstats = read_sumstat('/gpfs/gibbs/pi/dewan/data/UKBiobank/results/FastGWA_results/results_imputed_data/T2D_091120/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_030720_T2D.fastGWA.snp_stats.gz', '/home/hs863/project/UKBB_GWAS_DEV/data/fastGWA_template.yml' if False else None)
28
29 # Load phenotype file
30 971.5 MiB 40.7 MiB pheno = pd.read_csv('/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_030720', header=0, delim_whitespace=True, quotechar='"')
31 # Load unrelated sample file
32 972.0 MiB 0.5 MiB unr = pd.read_csv('/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620', header=0, delim_whitespace=True, quotechar='"')
33 # Load genotype file for the region of interest
34 972.4 MiB 0.4 MiB geno_inventory = dict([x.strip().split() for x in open('/gpfs/gibbs/pi/dewan/data/UKBiobank/results/UKBB_bgenfilepath.txt').readlines() if x.strip()])
Thank you @tfabiha !
I guess the next step is to figure out still which step would it cause a crash on the cluster. @haoyueshuai could you take a Python script directly for a region that crashed, and reserve an interactive computing node on the cluster with largest memory you can possibly obtain, then run the profile on that real data region, so we can hopefully complete it and see the actual memory usage? This will help us see why it happens. Sometimes maybe an alternative solution has even greater space complexity than the original solution so what helps on the MWE might not help or even worse on real data ...
Thank you @tfabiha !
I guess the next step is to figure out still which step would it cause a crash on the cluster. @haoyueshuai could you take a Python script directly for a region that crashed, and reserve an interactive computing node on the cluster with largest memory you can possibly obtain, then run the profile on that real data region, so we can hopefully complete it and see the actual memory usage? This will help us see why it happens. Sometimes maybe an alternative solution has even greater space complexity than the original solution so what helps on the MWE might not help or even worse on real data ...
@gaow Thank you! Considering your idea, we can try to find out a region that does not crash and ran successfully the last time, and run it again with the memory profiler. Since we could not get the memory profiling if the program crash in the middle of running. And I am not sure if I can get the memory larger than 60G on the interactive node, we will try to see. Besides, I can also try to run the python script on the jupyter notebook to see which line makes it crash. @tfabiha just let me know that she is going to work with the cluster soon, I will help her get it set up ASAP so both of us can work on it. Hopefully, with all these, we can get some clue about what's the real problem.
Hi @tfabiha, here is the discussion for the memory issue of our region extraction pipeline.
[ ] Fabiha: To reproduce both versions and see if she can get an idea based on both of them and to decrease more
[ ] Fabiha: To figure out how memory profiling works and how python works in memory-wise, to see if the memory used is accumulated for all the steps or it release some of it used from the previous steps. We should also take into consideration that the relative measures such as percentage of reduction are more informative than an absolute change of memory.
[x] Haoyue: since yale’s cluster is under maintenance, I will try to figure out if we could work on Columbia's cluster, also give you some of the real data from there, hopefully, you can get some idea of the real data, and see where might need more memories when data scales up.