Lucy-Forrest-Lab / HDXer

HDXer is a package to compute Hydrogen-Deuterium exchange data from biomolecular simulations, compare to experiment, and perform ensemble refinement to fit a structrual ensemble to the experimental data
BSD 3-Clause "New" or "Revised" License
17 stars 6 forks source link

Error in re-weighting step for multichain proteins #7

Closed gilbertovgx closed 1 year ago

gilbertovgx commented 1 year ago

Hello, I'm having issues in the reweighting step because my protein have two chains. At first I ran the notebook '03_reweigthin' and realized I got 'nan' for all segments in the second chain in the _final_segment_fractions.dat file. I realized that my Hbonds and Contacts .tmp files are with '0' and '1', like "Hbonds_chain_0res.tmp' and Hbonds_chain_1res*.tmp" , and its only reading the '0' index one as in lines 433-444 in HDXer/HDXer/reweighting.py script:

I tried to run like this specifying tmp prefix for both chains:

reweight_object = MaxEnt(do_reweight=True, do_params=False, stepfactor=0.00001) reweight_object.run(gamma=basegamma, data_folders=folders, kint_file=rates, exp_file=expt, times=times, restart_interval=100, out_prefix=f'reweighting_gamma1x10^{exponent}', hbonds_prefix=('Hbonds_chain_0res','Hbonds_chain_1res'), contacts_prefix=['Contacts_chain_0res','Contacts_chain_1res'])

But I get this error:

ValueError Traceback (most recent call last) Cell In[64], line 28 25 basegamma = 10**exponent 27 reweight_object = MaxEnt(do_reweight=True, do_params=False, stepfactor=0.00001) ---> 28 reweight_object.run(gamma=basegamma, data_folders=folders, kint_file=rates, exp_file=expt, times=times,\ 29 restart_interval=100, out_prefix=f'reweighting_gamma1x10^{exponent}',\ 30 hbonds_prefix=('Hbonds_chain_0res','Hbonds_chain_1res'),\ 31 contacts_prefix=('Contacts_chain_0res','Contacts_chain_1res')) 34 print(f'Reweighting for gamma = 1x10^{exponent} completed') 36 # Help text describing options and how to call the reweighting functions 37 # is available in the docstrings of the MaxEnt class, e.g.: 38 #help(MaxEnt) 39 #help(MaxEnt.run)

File ~/HDXer/HDXer/reweighting.py:970, in MaxEnt.run(self, gamma, resultsobj, analysisobj, restart, **run_params) 968 else: 969 try: --> 970 self.setup_no_runobj(self.runparams['data_folders'], 971 self.runparams['kint_file'], 972 self.runparams['exp_file'], 973 self.runparams['times']) 974 except KeyError: 975 raise HDX_Error("Missing parameters to set up a reweighting run.\n" 976 "Please ensure a restart or calc_hdx object is provided," 977 "or provide the following arguments to the run() call: " 978 "data_folders, kint_file, exp_file, times")

File ~/HDXer/HDXer/reweighting.py:96, in MaxEnt.setup_no_runobj(self, folderlist, kint_file, expt_file_path, times) 78 self.runvalues = {} 80 maxentvalues = { 'contacts' : None, 81 'hbonds' : None, 82 'minuskt' : None, (...) 94 'curriter' : None, 95 'is_converged' : None } ---> 96 _contacts, _hbonds, _sorted_resids = read_contacts_hbonds(folderlist, 97 self.runparams['contacts_prefix'], 98 self.runparams['hbonds_prefix']) 99 if self.runparams['do_subsample']: 100 _contacts, _hbonds = subsample_contacts_hbonds(_contacts, _hbonds, 101 self.runparams['sub_start'], 102 self.runparams['sub_end'], 103 self.runparams['sub_interval'])

File ~/HDXer/HDXer/reweighting_functions.py:110, in read_contacts_hbonds(folderlist, contacts_prefix, hbonds_prefix) 108 for r, f in list(zip(resids, filters)): 109 new_resids.append(np.array(r)[f]) --> 110 new_resids = np.stack(new_resids) 111 if not np.diff(new_resids, axis=0).sum(): # If sum of differences between filtered resids == 0 112 pass

File <__array_function__ internals>:200, in stack(*args, **kwargs)

File ~/.conda/envs/HDXER_ENV/lib/python3.8/site-packages/numpy/core/shape_base.py:464, in stack(arrays, axis, out, dtype, casting) 462 shapes = {arr.shape for arr in arrays} 463 if len(shapes) != 1: --> 464 raise ValueError('all input arrays must have the same shape') 466 result_ndim = arrays[0].ndim + 1 467 axis = normalize_axis_index(axis, result_ndim)

ValueError: all input arrays must have the same shape

How can I specify more than one chain when running the MaxEnt.

Your help will be really helpful. Thanks,

fabsugar commented 1 year ago

Hi, is it your protein homodimeric or heterodimeric? Namely, are your chains identical or different? The reweighting code was originally designed to analyze protein chains independently. For a homomeric complex you can reweight a trajectory using each chain individually. This assumes that residue 1 in chain 1 = residue 1 in chain 2 etc. This is however not correct for a heteromeric complex. Namely, if the sequence of the chains are different, the reweighting code does not support the specification of multiple chains yet (we are planning to add this functionality). Actually, the reweighting code does not read the chain index from the "exp_file" and it assumes it is an experimental time point, so please remove the chain index in the "exp_file" if you are doing reweighting. If you are working with a heterodimeric complex, the only options for now are: 1) Analyze each protein chain independently. 2) Assume the protein is a single chain; i.e. if the last residue in chain 1 is n, change your residue numbering so that residue 1 from chain 2 is residue n+1 from chain 1. Change also the numbering in "exp_file" and "seg_file" accordingly. Hope it helps, Fabrizio

gilbertovgx commented 1 year ago

Hi, Thanks for your response. My protein is in fact an heterodimeric complex, so I will go with "option 2" then. Thanks for also clarifying that the reweighting code will assume the chain index as time point in the "exp_file", this is very important.