Rework FEMap API - Githubissues

IAlibay commented 2 years ago

Current plan is to create an new API that will allow for both connected and disconnected nodes (i.e. relative and absolute results) to be passed in directly to arsenic.

To be discussed further over the next few weeks - @mikemhenry is currently in charge of the first pass of this API.

mikemhenry commented 2 years ago

What kinds of inputs do we want to support? I was thinking that we would want to support csv some sort of dictionary that has all the bits, and maybe a networkX graph?

csv Legacy format, people probably would want to use their old csv to check to see if there are any regressions in our changes.
dictionary Python dictionary with all the bits we need to build the networkx graph. This could be useful for users that don't want to write out a csv/want to use the API directly
networkX graph Could be useful for users that really know what they are doing but want to use our analysis tooling

IAlibay commented 2 years ago

I think all three would be great, but as a priority CSV + network object since that's how we'll interact with it short term

mikemhenry commented 2 years ago

I also think the dictionary will take some iteration/time to figure out the best structure and may be best to wait after we get the rest of the API figure out.

mikemhenry commented 2 years ago

RE: Using networkx directed graph

DiGraphs hold directed edges. Self loops are allowed but multiple (parallel) edges are not. We have talked about parallel edges being allowed, will we need to use a different networkx object to store our network?

mikemhenry commented 2 years ago

Will we always have a 1-1 mapping of simulated data + experimental data? Or will we have a case where one set is a super or subset of the other? Like could there be a case where we have experimental data for ligand A and ligand C but we also simulate ligand B which is part of the A->B->C network but we don't have experimental data for it?

IAlibay commented 2 years ago

Like could there be a case where we have experimental data for ligand A and ligand C but we also simulate ligand B which is part of the A->B->C network but we don't have experimental data for it?

Yes 100% we should expect that we have fewer experimental points than computed points. That's pretty much the standard use case - i.e. you have affinity data for a few ligands and then you do FEPs based on these initial ligands to attempt to find a better binder before the next round of synthesis.

mikemhenry commented 2 years ago

This is what I am thinking for the FEMap API. I included a bit for plotting a stats, with the important part being that someone could just use part of arsenic and then use another tool of their choice. For example, they may just want to use our stat methods so being able to pull out a pandas data frame that they can use in another tool would be good for interoperability.

import arsenic
# Three 'main' module name spaces 
import arsenic.network
import arsenic.stats
import arsenic.viz 

# Main python entry point for making the FE Network
fe_map = arsenic.network.FEMap()

# Python API
# Add some experimental data
# Do we need a "expt" namespace on these values?
fe_map.add_node("ligand_A", DeltaG=-8.93*unit.kilocalories_per_mole, dDeltaG=0.10*unit.kilocalories_per_mole)
fe_map.add_node("ligand_B", DeltaG=-9.11*unit.kilocalories_per_mole, dDeltaG=0.10*unit.kilocalories_per_mole)

# Add some simulated data
# Dropping the units to save some space here
fe_map.add_edge("ligand_A", "ligand_B", calc_DDG=0.36, calc_dDDG_MBAR=0.11, calc_dDDG_additional=0.0)

# Loops work, not sure what data we want to store
fe_map.add_edge("ligand_A", "ligand_A")

# network is directional, so we can do A->B, B->A checks
fe_map.add_edge("ligand_B", "ligand_A", calc_DDG=0.36, calc_dDDG_MBAR=0.11, calc_dDDG_additional=0.0)

# For absolute free energy, the environment is needed for bookkeeping 
# Under the hood we will name the node ligand_H_Complex, and ligand_H_Solvent
fe_map.add_node("ligand_H", DDG=-8.83, dDDG=0.10, environment="Complex")
fe_map.add_node("ligand_H", DDG=-8.83, dDDG=0.10, environment="Solvent")

# We can also create a network from a CSV file
fe_map = arsenic.network.FEMap.from_csv("my_data.csv")

# Calculate statistics 
fe_stats = arsenic.stats.analyze_map(fe_map)
fe_stats.calculate("all") # ['RMSE', 'MUE', 'R2', 'rho','RAE','KTAU']
my_data_frame = fe_stats.to_pandas() # returns pandas data frame

# Plotting
plot = arsenc.viz.make_plot(backend="matplotlib")
# Add plotting options
options = {'method_name': 'softwarename',
 'target_name': 'made up protein',
 'color': 'hotpink',
 'guidelines': False}

# Make plot
plot.draw(plot, **options)

@richardjgowers let me know what you think!

I'm not sure if we should just use edge/node or do something like add_experiment and add_calculation.

richardjgowers commented 2 years ago

So you're using string labels, but I'm assuming any hashable works, i.e. we could be dropping in gufe systems, but also that's not required.

I'm guessing that under the hood there's a nx.multidigraph actually doing the data management.

I'm tempted to say that maybe the measurements (calculated or exptl) should be an object themselves (somewhere in this repo there's ExperimentalResults and RelativeResult objects). These could handle a few quirky cases like:

asymmetric errors (see also experimental measurement limits for non actives)
storing the individual legs of a calculation

So something like g.add_edge(systemA, systemB, RelativeResult(...)).

I would say you could be clever and use __float__ to make these fancy objects quack like floats, but you'll still always want the uncertainty, so probably better to make them quack like a pint.Measurement

orbeckst commented 2 years ago

Outsider (TAC) comment:

networkx graph as data structure looks very sensible (and should be easy to convert to a dict if really needed)
I'd look into quantities that can carry uncertainties (see e.g. uncertainties – perhaps pint.Measurement is doing the same?); having the opportunity to do automatic error propagation is nice
From our experience with https://github.com/Becksteinlab/multibind , I would pay attention to cycles in the graphs, as they are valuable both for error checking and for strong constraints on the data.

mikemhenry commented 2 years ago

Okay minor changes, using a class for the results and using deltaG deltadeltaG and variance instead of using the old convention here https://github.com/OpenFreeEnergy/arsenic#terminology

from opennff.units import unit
import arsenic
# Three 'main' module name spaces 
import arsenic.network
import arsenic.stats
import arsenic.viz 

# Main python entry point for making the FE Network
fe_map = arsenic.network.FEMap()

# Python API
# Add some experimental data
# Do we need a "expt" namespace on these values?

fe_map.add_node("ligand_A", ExperimentalResult(deltaG=-8.93*unit.kilocalories_per_mole, 
                                               variance=0.10*unit.kilocalories_per_mole))
fe_map.add_node("ligand_B", ExperimentalResult(deltaG=-9.11*unit.kilocalories_per_mole, 
                                               variance=0.10*unit.kilocalories_per_mole))

# Add some simulated data
# Dropping the units to save some space here
fe_map.add_edge("ligand_A", "ligand_B", RelativeResult(deltadeltaG=0.36*unit.kilocalories_per_mole, 
                                                       variance=0.11*unit.kilocalories_per_mole))

# We can also create a network from a CSV file
fe_map = arsenic.network.FEMap.from_csv("my_data.csv")

# We can save the network to disk as a GraphML XML
fe_map.save("my_network")

# Calculate statistics and maximum likelihood estimate 
fe_stats = arsenic.stats.analyze_map(fe_map)
fe_stats.calculate("all") # ['RMSE', 'MUE', 'R2', 'rho','RAE','KTAU']

# returns pandas data frame
my_data_frame = fe_stats.to_pandas() 

# Plotting
plot = arsenc.viz.make_plot(backend="matplotlib")
# Add plotting options
options = {'method_name': 'softwarename',
 'target_name': 'made up protein',
 'color': 'hotpink',
 'guidelines': False}

# Make plot
plot.draw(plot, **options)

ijpulidos commented 2 years ago

I wonder if we also want a way to have the information of different experiments (simulations) in the same FEMap for comparison. As in, I run the same set of transformations using a set of parameters and I want to get a comparison with the same set of transformations using a different set of parameters (or maybe even different engines). For this it might be desired that we want to see the results of both experiments in the same plot.

This may be out of the scope of this issue, but I am now faced with this situation where I want to see results from two different experiments in the same plot, to check which dots/transformations are really different between both. Does that make any sense?

richardjgowers commented 2 years ago

@ijpulidos yeah I think allowing multiple edges (measurements) between the same nodes (chemical states) is definitely a good idea. It does imply there's also some way to tag edges with an annotation to differentiate/group them later on.

IAlibay commented 2 years ago

re: multiple calculation types on one graph, one of the questions I would ask here is what advantage do we provide by having a single graph rather than a graph per experiment? The main reason for mixing calculated and experimental data on the same graph is so that we can potentially normalise one to the other, however with two separate sets of calculations I'm not sure what advantage you get. Playing devil's advocate, would it not be more suitable to have separate graphs and then let users combine data for further analysis?

ijpulidos commented 11 months ago

@IAlibay While it's mostly just a convenience for bookkeeping and wrangling data, I guess another case this might come handy is when comparing different replicates from simulations using the same methods or protocol. And this can be applied to check some kind of convergence between them. I have been using the _master_plot function to compare different replicates of simulations this way. I guess having them in the FEMap is just going to make it even more straight forward.

OpenFreeEnergy / cinnabar

Rework FEMap API #55