Open IAlibay opened 2 years ago
What kinds of inputs do we want to support? I was thinking that we would want to support csv
some sort of dictionary that has all the bits, and maybe a networkX graph?
csv
Legacy format, people probably would want to use their old csv to check to see if there are any regressions in our changes.
dictionary Python dictionary with all the bits we need to build the networkx graph. This could be useful for users that don't want to write out a csv/want to use the API directly
networkX graph Could be useful for users that really know what they are doing but want to use our analysis tooling
I think all three would be great, but as a priority CSV + network object since that's how we'll interact with it short term
I also think the dictionary will take some iteration/time to figure out the best structure and may be best to wait after we get the rest of the API figure out.
RE: Using networkx directed graph
DiGraphs hold directed edges. Self loops are allowed but multiple (parallel) edges are not. We have talked about parallel edges being allowed, will we need to use a different networkx object to store our network?
Will we always have a 1-1 mapping of simulated data + experimental data? Or will we have a case where one set is a super or subset of the other? Like could there be a case where we have experimental data for ligand A
and ligand C
but we also simulate ligand B
which is part of the A->B->C network but we don't have experimental data for it?
Like could there be a case where we have experimental data for ligand A and ligand C but we also simulate ligand B which is part of the A->B->C network but we don't have experimental data for it?
Yes 100% we should expect that we have fewer experimental points than computed points. That's pretty much the standard use case - i.e. you have affinity data for a few ligands and then you do FEPs based on these initial ligands to attempt to find a better binder before the next round of synthesis.
This is what I am thinking for the FEMap API. I included a bit for plotting a stats, with the important part being that someone could just use part of arsenic and then use another tool of their choice. For example, they may just want to use our stat methods so being able to pull out a pandas data frame that they can use in another tool would be good for interoperability.
import arsenic
# Three 'main' module name spaces
import arsenic.network
import arsenic.stats
import arsenic.viz
# Main python entry point for making the FE Network
fe_map = arsenic.network.FEMap()
# Python API
# Add some experimental data
# Do we need a "expt" namespace on these values?
fe_map.add_node("ligand_A", DeltaG=-8.93*unit.kilocalories_per_mole, dDeltaG=0.10*unit.kilocalories_per_mole)
fe_map.add_node("ligand_B", DeltaG=-9.11*unit.kilocalories_per_mole, dDeltaG=0.10*unit.kilocalories_per_mole)
# Add some simulated data
# Dropping the units to save some space here
fe_map.add_edge("ligand_A", "ligand_B", calc_DDG=0.36, calc_dDDG_MBAR=0.11, calc_dDDG_additional=0.0)
# Loops work, not sure what data we want to store
fe_map.add_edge("ligand_A", "ligand_A")
# network is directional, so we can do A->B, B->A checks
fe_map.add_edge("ligand_B", "ligand_A", calc_DDG=0.36, calc_dDDG_MBAR=0.11, calc_dDDG_additional=0.0)
# For absolute free energy, the environment is needed for bookkeeping
# Under the hood we will name the node ligand_H_Complex, and ligand_H_Solvent
fe_map.add_node("ligand_H", DDG=-8.83, dDDG=0.10, environment="Complex")
fe_map.add_node("ligand_H", DDG=-8.83, dDDG=0.10, environment="Solvent")
# We can also create a network from a CSV file
fe_map = arsenic.network.FEMap.from_csv("my_data.csv")
# Calculate statistics
fe_stats = arsenic.stats.analyze_map(fe_map)
fe_stats.calculate("all") # ['RMSE', 'MUE', 'R2', 'rho','RAE','KTAU']
my_data_frame = fe_stats.to_pandas() # returns pandas data frame
# Plotting
plot = arsenc.viz.make_plot(backend="matplotlib")
# Add plotting options
options = {'method_name': 'softwarename',
'target_name': 'made up protein',
'color': 'hotpink',
'guidelines': False}
# Make plot
plot.draw(plot, **options)
@richardjgowers let me know what you think!
I'm not sure if we should just use edge/node or do something like add_experiment
and add_calculation
.
So you're using string labels, but I'm assuming any hashable works, i.e. we could be dropping in gufe systems, but also that's not required.
I'm guessing that under the hood there's a nx.multidigraph
actually doing the data management.
I'm tempted to say that maybe the measurements (calculated or exptl) should be an object themselves (somewhere in this repo there's ExperimentalResults and RelativeResult objects). These could handle a few quirky cases like:
So something like g.add_edge(systemA, systemB, RelativeResult(...))
.
I would say you could be clever and use __float__
to make these fancy objects quack like floats, but you'll still always want the uncertainty, so probably better to make them quack like a pint.Measurement
Outsider (TAC) comment:
pint.Measurement
is doing the same?); having the opportunity to do automatic error propagation is niceOkay minor changes, using a class for the results and using deltaG
deltadeltaG
and variance
instead of using the old convention here https://github.com/OpenFreeEnergy/arsenic#terminology
from opennff.units import unit
import arsenic
# Three 'main' module name spaces
import arsenic.network
import arsenic.stats
import arsenic.viz
# Main python entry point for making the FE Network
fe_map = arsenic.network.FEMap()
# Python API
# Add some experimental data
# Do we need a "expt" namespace on these values?
fe_map.add_node("ligand_A", ExperimentalResult(deltaG=-8.93*unit.kilocalories_per_mole,
variance=0.10*unit.kilocalories_per_mole))
fe_map.add_node("ligand_B", ExperimentalResult(deltaG=-9.11*unit.kilocalories_per_mole,
variance=0.10*unit.kilocalories_per_mole))
# Add some simulated data
# Dropping the units to save some space here
fe_map.add_edge("ligand_A", "ligand_B", RelativeResult(deltadeltaG=0.36*unit.kilocalories_per_mole,
variance=0.11*unit.kilocalories_per_mole))
# We can also create a network from a CSV file
fe_map = arsenic.network.FEMap.from_csv("my_data.csv")
# We can save the network to disk as a GraphML XML
fe_map.save("my_network")
# Calculate statistics and maximum likelihood estimate
fe_stats = arsenic.stats.analyze_map(fe_map)
fe_stats.calculate("all") # ['RMSE', 'MUE', 'R2', 'rho','RAE','KTAU']
# returns pandas data frame
my_data_frame = fe_stats.to_pandas()
# Plotting
plot = arsenc.viz.make_plot(backend="matplotlib")
# Add plotting options
options = {'method_name': 'softwarename',
'target_name': 'made up protein',
'color': 'hotpink',
'guidelines': False}
# Make plot
plot.draw(plot, **options)
I wonder if we also want a way to have the information of different experiments (simulations) in the same FEMap for comparison. As in, I run the same set of transformations using a set of parameters and I want to get a comparison with the same set of transformations using a different set of parameters (or maybe even different engines). For this it might be desired that we want to see the results of both experiments in the same plot.
This may be out of the scope of this issue, but I am now faced with this situation where I want to see results from two different experiments in the same plot, to check which dots/transformations are really different between both. Does that make any sense?
@ijpulidos yeah I think allowing multiple edges (measurements) between the same nodes (chemical states) is definitely a good idea. It does imply there's also some way to tag edges with an annotation to differentiate/group them later on.
re: multiple calculation types on one graph, one of the questions I would ask here is what advantage do we provide by having a single graph rather than a graph per experiment? The main reason for mixing calculated and experimental data on the same graph is so that we can potentially normalise one to the other, however with two separate sets of calculations I'm not sure what advantage you get. Playing devil's advocate, would it not be more suitable to have separate graphs and then let users combine data for further analysis?
@IAlibay While it's mostly just a convenience for bookkeeping and wrangling data, I guess another case this might come handy is when comparing different replicates from simulations using the same methods or protocol. And this can be applied to check some kind of convergence between them. I have been using the _master_plot
function to compare different replicates of simulations this way. I guess having them in the FEMap is just going to make it even more straight forward.
Current plan is to create an new API that will allow for both connected and disconnected nodes (i.e. relative and absolute results) to be passed in directly to arsenic.
To be discussed further over the next few weeks - @mikemhenry is currently in charge of the first pass of this API.