GPCR-ModSim / qfepweb

Development of the web interface to QligFEP.
GNU Lesser General Public License v2.1
2 stars 1 forks source link

Random failure when uploading a SDF. #13

Closed xbello closed 3 years ago

xbello commented 3 years ago

When a SDF is uploaded, the system sometimes runs smoothly, and sometimes hangs/crashes.

My main suspect is that the SDF file "pointer" is sometimes GC'd when is passed to RDKit, and in the next round you get an empty reference (use a more real file than an io stream?)

Another possibility is that the file isn't getting rewinded properly sometimes, so RDKit receives a valid file pointer, but at the EOF (assert that file is at 0 before passing?)

JenkeScheen commented 3 years ago

Could this have to do with the fact that in some cases RDKit, when it fails to read a molecule just returns None instead of throwing an exception? This is behaviour that they've set deliberately (I'm sure there's a good reason).

xbello commented 3 years ago

It that is true, we have to shield against Nones and... retry? fail (not user-friendly)?

Silencing errors is a mistake, IMO. Even more if they happen randomly.

JenkeScheen commented 3 years ago

I'll run some tests, could you point me to an example molecule that is failing (sometimes)?

xbello commented 3 years ago

https://github.com/GPCR-ModSim/qfepweb/blob/main/networkgen/test_files/CDK2_ligands.sdf

JenkeScheen commented 3 years ago

I'm having some trouble reproducing your bug, so perhaps RDKit isn't the problem here? image Could you point me to where you use RDKit to read in the molecules?

xbello commented 3 years ago

Usually, here:

https://github.com/GPCR-ModSim/qfepweb/blob/6452969f41ca4f194b9c7d69e6761bcff5407cbc/networkgen/mapgen.py#L39

You'll need to create some object with the field "in_sdf", and put a file there. I'm using "io" from the python stdlib.

The file is also read here: https://github.com/GPCR-ModSim/qfepweb/blob/6452969f41ca4f194b9c7d69e6761bcff5407cbc/networkgen/validators.py#L5 to perform the first validation.

So the current processing involves:

To reiterate: we are not reading from a disk file, but from a memory file.

JenkeScheen commented 3 years ago

It's not immediately clear to me where this could go wrong, as you say it might be more of a memory issue. AFAIK RDKit shouldn't have any trouble with loading these molecules.

You could potentially build in two more checks in the validator:

These are just the RDKit checks I can think of, if it is indeed an issue with file passing I unfortunately don't have many ideas.

JenkeScheen commented 3 years ago

FWIW, doing simple checks like that with RDKit shouldn't add any considerable computing cost to the workflow: image

xbello commented 3 years ago

It's not immediately clear to me where this could go wrong, as you say it might be more of a memory issue. AFAIK RDKit shouldn't have any trouble with loading these molecules.

You could potentially build in two more checks in the validator:

Sure we can. And "one only" should be gone, as we plan to support multiple file upload.

JenkeScheen commented 3 years ago

just looking at this now, it might be a good idea to add docstrings to mapgen.py as at the moment it is completely unclear what each function does, or why they exist.

JenkeScheen commented 3 years ago

my guess is that @shevsea will have the best intuition for this - I'd suggest a simple line to describe the main function idea should be enough, just as in https://github.com/GPCR-ModSim/qfepweb/blob/main/networkgen/mapgen.py#L64.

JenkeScheen commented 3 years ago

posting discussion (+ my thoughts) with @xbello here for posterity.

There seem to be two root causes for this issue, being: 1) in cases where only two ligands are in a group (i.e. either in a charge group, or only two ligands were uploaded to begin with) the cyclisation algorithm fails because an outer_node is unable to connect to another. 2) in some cases a cycle is created leaving one remaining outer_node that is unable to connect to any other outer_nodes.

Both cases cause the while loop in make_map() to hang. I think ideally we'd refactor this code to just use a for loop (the need for a while here isn't really obvious to me): it should be okay to just loop over outer_nodes, connect them and then call it a day (potentially double-check if everything is in a cycle).

for 1): it would save us a lot of work (and some user compute time) to just catch cases where there are only two ligands and disregard the whole network generation algorithm (since it's obvious what the graph will look like).

for 2): I don't quite understand why in this stage outer_nodes are forced to connect to outer_nodes only. Presumably any outer_node should be cyclised to whichever other node it has the highest similarity (or some other metric) to? I'd be curious to find out the rationale for this @shevsea. Regardless, if this is intended then there should be functionality to revert to connecting to an inner_node in cases where no outer_nodes are available.

xbello commented 3 years ago

To clarify case 1), two outer_nodes cannot never get out of outer_nodes, because the condiction to be an outer_node is to have 1 edge, that by definition is the number of edges two nodes can have. So the two outer_nodes get connected, but are still outer nodes.

shevsea commented 3 years ago

Hi both,

Sorry for my long absence, I'm on vacation until Wednesday.

Regarding 1) I find @JenkeScheen suggestion of neglecting the algorithm and just connecting both ligands ideal.

Concerning 2) initially the algorithm was written so in the cyclisation process all the nodes are considered, however I believe this lead me to crowded/high density graphs (since you'll need to add more edges to compensate the removal of the restraint), I also remember this first implementation had poorer performance MUE wise.

JenkeScheen commented 3 years ago

okay, should be easy enough to implement. I'll try to find the time to PR something

JenkeScheen commented 3 years ago

1) is pretty easy, 2) I'm not sure yet. @xbello, is there any way for me to get your observed behaviour of the while loop 'sometimes' hanging? The multi-sdf https://github.com/GPCR-ModSim/qfepweb/blob/main/networkgen/test_files/CDK2_ligands.sdf seems to work consistently. If you see no way of getting it consistently, does it always hang at the last outer_node (i.e. len(self.outer_nodes(H)) == 1), or do you think there can be multiple during the hang?

xbello commented 3 years ago

The only consistent hang I got was with two ligands.

The 2) is what you said. At least we have a suspect (alone outer_node) we could solve, and keep going until the it happens again. We could exit the while on that condition.

Another option is to keep track on new edges created on each while loop: if no new edges, exit immediately because we are entering an infinite loop. This also works with the 2-nodes event. Makes sense?

JenkeScheen commented 3 years ago

yes, that sounds reasonable. Will implement your last point. I think I'll keep in the 'stop if only two ligands' for now because it improves readability.

hteran commented 3 years ago

This sounds the most reasonable solution to me, good job all!

On 10 Aug 2021, at 11:48, JenkeScheen @.**@.>> wrote:

yes, that sounds reasonable. Will implement your last point. I think I'll keep in the 'stop if only two ligands' for now because it improves readability.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/GPCR-ModSim/qfepweb/issues/13#issuecomment-895889319, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABU6HHQXURGHXFB3AIPK34TT4DYWRANCNFSM5AGIABCQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

xbello commented 3 years ago

17 Closes this. Will reopen if we hit it again.