Closed anton-bushuiev closed 1 year ago
Taking another look at this. It looks like the switch from distance matrices of the shape N x N
to 1 x N x N
is breaking PyG style batching. Do you have any arguments against changing this back?
No, I agree that N x N
is the right option. However, I did not change this. It is caused by this line which was there before.
Kudos, SonarCloud Quality Gate passed!
I've made some changes to the CI/CD in #244 which I expect will resolve the test failures here
It looks like we're running into an issue when it comes to collating the distance matrices into a batch.
I suppose we have three options:
n x n
-> n^2 x 1
)1 x n x n
which then becomes 1 x max(n) x max(n)
where max(n)
is the largest number of nodes in the batch and smaller proteins are padded appropriatelyDo you have any strong opinions?
That's a good question. I'll try it and let you know.
Hi, @a-r-j!
From my perspective, (1) is much better than (2) but (3) is the best. I would not serialize distance matrices for several reasons:
What do you think about it?
Very good points @anton-bushuiev.
I am not sure about not supporting it all. While distance matrices are easy to compute, we can still run into scenarios where there are matrix features we may want to include (e.g. Hbond map). Thus, I propose:
columns
arg if they are to be includedI agree with the first two points but I am not sure about sparse format. Working with protein graphs, distance matrices are always dense because physical laws allow only zeros on diagonals. That's why I think simple reshaping would be the best:
# Flatten before writing
data.dist_mat = data.dist_mat.reshape(data.num_nodes * data.num_nodes)
# Restore after reading
data.dist_mat = data.dist_mat.reshape((data.num_nodes, data.num_nodes))
Do I miss the scenarios when they may be sparse? What is Hbond map (can you please send some link to its usage in Graphein)?
Kudos, SonarCloud Quality Gate passed!
Hi @anton-bushuiev,
Your commit for "Improve convert_nx_to_pyg
" added some breaking changes. Edges do not necessarily have a kind unless I'm not understanding something?
# Split edge index by edge kind
kind_strs = np.array(list(map(lambda x: "_".join(x), data["kind"])))
for kind in set(kind_strs):
key = f"edge_index_{kind}"
if key in self.columns:
mask = kind_strs == kind
data[key] = edge_index[:, mask]
if "kind" not in self.columns:
del data["kind"]
Reference Issues/PRs
No Reference Issues/PRs
What does this implement/fix? Explain your changes
This PR fixes the bugs related to the processing of PyG data.
graphein.ml.conversion.convert_nx_to_pyg
data.coords
tensor with an extra dimension becausecoords
is also a “graph-level feature”.torch.Tensor
s. Makes further usage much more easier and a resulting data object much more PyG-like.graphein.ml.visualisation.plot_pyg_data
plotly_protein_structure_graph
. Currently, it lacks the positional value fornode_size_feature
and the order of the following arguments is completely broken.coords
processing to the change number 2 inconvert_nx_to_pyg
.What testing did you do to verify the changes in this PR?
Pull Request Checklist
./CHANGELOG.md
file (if applicable)./graphein/tests/*
directories (if applicable)./notebooks/
(if applicable)python -m py.test tests/
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,python -m py.test tests/protein/test_graphs.py
)black .
andisort .