Add PDI sampling support

RainierBarrett commented 3 years ago

This PR adds support for sampling from a molecular weight distribution given any two of [PDI, M_n, M_w], with some input validation. We analytically solved for the value of sigma in a gaussian distribution centered at M_n by integrating the continuous form of M_w. Right now, this does not exactly match the numerical examples used. Not sure why that is, but it is very close.

chrisjonesBSU commented 3 years ago

Couple of things:

Per the discussion today lets try using an analytical approach.
Can you move all of the PDI code into its own funciton that we can call within the init function? I think this will keep things nice and clean, and also, I imagine there are other places (e.g. planckton) we can use this PDI sampling code, so keeping it portable would be useful.

chrisjonesBSU commented 3 years ago

So far it looks like its working well. I did get some errors when passing a value for pdi and M_n while leaving M_w = None

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-4563c3631d4d> in <module>
      1 test_system = System(molecule="PEEK", para_weight=0.60,
      2                     density=1, n_compounds=20, sample_pdi=True, M_n=10., pdi=2,
----> 3                     forcefield=None)

<ipython-input-15-e52777e0d53c> in __init__(self, molecule, para_weight, density, n_compounds, polymer_lengths, forcefield, remove_hydrogens, epsilon, sample_pdi, pdi, M_n, M_w)
    215                     self.pdi = M_w / M_n
    216             mass_mean = M_n
--> 217             mass_var = M_n * (M_w - M_n) # from solving integrated gaussians
    218             # now sample from our distribution
    219             print(f'sampling {n_compounds} molecular weights...')

TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'

I think it was due to mass_var calling for M_w when it was equal to None. When M_w was solved for, the value was assigned to self.M_w, so M_w as passed was still equal to None

I was able to fix it by making sure self.M_w, self.M_n and self.pdi are always defined and updating everything afterwards to use the self. attributes.

I think it will be helpful to make these values as System class attributes. When using signac in the uli-flow repo, we can dump a bunch of the system information for each job into that job's .json file. We'll include the values of M_n, M_w and pdi in that info dump.

Here is what I changed:

        if sample_pdi:
            if isinstance(n_compounds, list):
                raise TypeError('n_compounds should be an integer when sample_pdi is True.')
            pdi_arg_sum = sum([x is not None for x in [pdi, M_n, M_w]])
            assert pdi_arg_sum >= 2, 'At least two of [pdi, M_n, M_w] must be given.'
            if pdi_arg_sum == 3:
                #special case, make sure that pdi = M_w / M_n
                assert pdi - (M_w/M_n) < self.epsilon, 'PDI value does not match M_n and M_w values.'
            else:
                # need to recover one of M_w or M_n or pdi
                if M_n is None:
                    self.M_n = M_w / pdi
                    self.M_w = M_w
                    self.pdi = pdi
                if M_w is None:
                    self.M_w = pdi * M_n
                    self.M_n = M_n
                    self.pdi = pdi
                if pdi is None:
                    self.pdi = M_w / M_n
                    self.M_w = M_w
                    self.M_n = M_n
            mass_mean = self.M_n
            mass_var = self.M_n * (self.M_w - self.M_n) # from solving integrated gaussians
            # now sample from our distribution
            print(f'sampling {n_compounds} molecular weights...')
            # TODO: make sure we don't sample any negative weights
            samples = np.round(np.random.normal(loc=mass_mean, 
                                                scale=np.sqrt(mass_var), 
                                                size=n_compounds)
                              ).astype(int)
            # get all unique lengths in increasing order
            self.polymer_lengths = sorted(list(set(samples)))
            # get count of each length
            self.n_compounds = [list(samples).count(x) for x in self.polymer_lengths]
            print(f'polymer_lengths: {self.polymer_lengths}, n_compounds: {self.n_compounds}')

That seems to work with specifying any combination of M_w, M_n, and pdi.

Other than that, everything looks good except for getting negative polymer lengths. We'll wait to merge in these changes since negative polymer lengths does break the system initialization.

chrisjonesBSU commented 3 years ago

Couple of things:

I'm still getting the same errors related to my previous comment. Parts of the code are calling for self.M_n, but its never actually being set as a class attribute when its passed in by the user. I offered a fix/suggestion in that comment.

<ipython-input-12-774b9e97a45b> in _recover_mass_dist(self, distribution)
    272         elif distribution.lower() == 'weibull':
    273             # get the shape parameter
--> 274             a = scipy.optimize.root(self._weibull_k_expression, args=(self.Mn, self.Mw), x0=1.)
    275             recovered_k = a['x']
    276             # get the scale parameter

AttributeError: 'System' object has no attribute 'Mn'

n_compounds is going to be entered as a list because of how the uli-flow repo is set up. Instead of throwing an error when sample_pdi = True and n_compounds is a list, lets just pull out the value from the list. The important thing is that we only have a single number for n_compounds when sample_pdi = True.

Before:

 if isinstance(n_compounds, list):
                raise TypeError('n_compounds should be an integer when sample_pdi is True.')

Instead, lets do something like this:

if isinstance(n_compounds int):
    self.n_compounds = n_compounds
elif isinstance(n_compounds, list) and len(n_compounds) == 1:
    self.n_compounds = n_compounds[0]
elif isinstance(n_compounds, list) and len(n_compounds) != 1:
    raise TypeError('n_compounds should be of length 1 when sample_pdi is True.')

Do we need the .csv and image files merged in? Maybe modify the PR to only include the changes made to the notebook. Also, eventually these changes will have to be added to simulate.py which is what will ultimately be used in this package.

RainierBarrett commented 3 years ago

Sorry, guess I mistakenly grabbed the .svg files. I guess in practice we don't need to have the .csv files committed and available, but then it might make more sense to just move those cells to a scratchwork book or something, so anyone can run the main notebook without issue. What do you think? In the meantime I'll get these other points sorted.

chrisjonesBSU commented 3 years ago

Sorry, guess I mistakenly grabbed the .svg files. I guess in practice we don't need to have the .csv files committed and available, but then it might make more sense to just move those cells to a scratchwork book or something, so anyone can run the main notebook without issue. What do you think? In the meantime I'll get these other points sorted.

Yeah, I think moving a lot of the pdi sampling work that generating the plots and csv files to a separate notebook would be a good idea. Having all of that will be helpful if we decide to go into detail about it when writing a paper or if we have to present about it. Also, I wonder if this PDI sampling code could eventually be implemented into mBuild...maybe as a plugin.

RainierBarrett commented 3 years ago

Okay, those commits should address your points of concern. However, before I merge this in, now that I'm running this I'm getting an error from Foyer saying there's multiple atom types being given. Sampling of the molar masses seems to have worked, but the next step is dying. Can you replicate this or is something in my install going haywire?

polymer_lengths: [19, 30, 31, 45, 62, 63, 64, 65, 68, 82], n_compounds: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

---------------------------------------------------------------------------
FoyerError                                Traceback (most recent call last)
<ipython-input-65-7ad801da21e3> in <module>
      1 test_system = System(molecule="PEEK", para_weight=0.60,
      2                     density=1, n_compounds=10, sample_pdi=True, M_n=50., M_w=55.,
----> 3                     forcefield='gaff')

<ipython-input-64-5b9435b53feb> in __init__(self, molecule, para_weight, density, n_compounds, polymer_lengths, forcefield, remove_hydrogens, epsilon, sample_pdi, pdi, M_n, M_w)
    237         self.system_mb = self._pack() # mBuild object before applying FF
    238         if self.forcefield:
--> 239             self.system_pmd = self._type_system() # parmed object after applying FF
    240 
    241 

<ipython-input-64-5b9435b53feb> in _type_system(self)
    273             forcefield = foyer.Forcefield(name='oplsaa')
    274 
--> 275         typed_system = forcefield.apply(self.system_mb)
    276         if self.remove_hydrogens: # not sure how to do this with Parmed yet
    277             removed_hydrogen_count = 0 # subtract from self.mass

~/foyer/foyer/forcefield.py in apply(self, structure, references_file, use_residue_map, assert_bond_params, assert_angle_params, assert_dihedral_params, assert_improper_params, combining_rule, verbose, *args, **kwargs)
    571                 structure = structure.to_parmed(**kwargs)
    572 
--> 573         typemap = self.run_atomtyping(structure, use_residue_map=use_residue_map, **kwargs)
    574 
    575         self._apply_typemap(structure, typemap)

~/foyer/foyer/forcefield.py in run_atomtyping(self, structure, use_residue_map, **kwargs)
    609                     if structure.residues[res_id].name not in residue_map.keys():
    610                         tmp_res = _structure_from_residue(res, structure)
--> 611                         typemap = find_atomtypes(tmp_res, forcefield=self)
    612                         residue_map[res.name] = typemap
    613 

~/foyer/foyer/atomtyper.py in find_atomtypes(structure, forcefield, max_iter)
     63 
     64     _iterate_rules(rules, structure, typemap, max_iter=max_iter)
---> 65     _resolve_atomtypes(structure, typemap)
     66 
     67     return typemap

~/foyer/foyer/atomtyper.py in _resolve_atomtypes(structure, typemap)
    131         elif len(atomtype) > 1:
    132             raise FoyerError("Found multiple types for atom {} ({}): {}.".format(
--> 133                 atom_id, atoms[atom_id].atomic_number, atomtype))
    134         else:
    135             raise FoyerError("Found no types for atom {} ({}).".format(

FoyerError: Found multiple types for atom 3631 (6): ['cv', 'ce'].

chrisjonesBSU commented 3 years ago

There are still a few bugs. I think I mentioned them in an earlier comment, but we need to make sure the class attributes for Mn, Mw, and pdi are always being set. This is because the code as written calls the class attributes in the _recover_mass_dist function.

For example:

a = scipy.optimize.root(self._weibull_k_expression, args=(self.Mn, self.Mw), x0=1.)

This calls for self.Mn; however, self.Mn is only being defined when its being solved for:

            if pdi_arg_sum == 3:
                #special case, make sure that pdi = Mw / Mn
                assert pdi - (Mw/Mn) < self.epsilon, 'PDI value does not match Mn and Mw values.'
            else:
                # need to recover one of Mw or Mn or pdi
                if Mn is None:
                    self.Mn = Mw / pdi
                if Mw is None:
                    self.Mw = pdi * Mn
                if pdi is None:
                    self.pdi = Mw / Mn # from here

So, if I pass in values for Mn and pdi then the script solves for and defines self.Mw, but never defines the variables self.Mn and self.pdi

Similarly, if I were to pass in values for all 3 of Mw, Mn and pdi, then none of the self.Mn self.Mw and self.pdi are defined.

Here is what needs to be added (you'll notice my changes under each of the If statements):

if pdi_arg_sum == 3:
    #special case, make sure that pdi = Mw / Mn
    assert pdi - (Mw/Mn) < self.epsilon, 'PDI value does not match Mn and Mw values.'
    self.pdi = pdi
    self.Mn = Mn
    self.Mw = Mw
else:
    # need to recover one of Mw or Mn or pdi
    if Mn is None:
        self.Mn = Mw / pdi
        self.Mw = Mw
        self.pdi = pdi
    if Mw is None:
        self.Mw = pdi * Mn
        self.Mn = Mn
        self.pdi = pdi
    if pdi is None:
        self.pdi = Mw / Mn # from here
        self.Mn = Mn
        self.Mw = Mw

Also, once this is working we'll have to add these changes to simulate.py since that is what is actually being used. The notebook is kind of a scratch pad and a temporary way of testing any changes until we get proper unit tests, but simulate.py is what is being used when submitting jobs on fry.

chrisjonesBSU commented 3 years ago

I'm going to close this, see PR #25.

There were just a few conflicts because of recent PRs made..for the sake of speeding this up I created my own branch, copied over the simulate.py file from this PR and fixed the conflicts. Sorry for being a bit janky but just wanted to move things along...also, we need some unit tests as soon as possible!

cmelab / polybinder

Add PDI sampling support #14