covidgraph / motherlode

Pipeline for running all dataloader scripts for covidgraph in a controlled manner.
https://covidgraph.org
MIT License
3 stars 1 forks source link

CORD-19 Dataset antiviral compounds #26

Open motey opened 4 years ago

motey commented 4 years ago

With version 5 of the CORD19 dataset a list of Anti-Viral Candidate Compounds was included.

Actually, i just realized, this data was attached from the team, working on a python tool around the CORD-19 dataset. https://github.com/josephsdavid/cord-19-tools At the moment there is no documentation from where they have this data.

The attached readme says following:

CAS COVID-19 Anti-Viral Candidate Compounds Readme

* The dataset includes 49,437 anti-viral candidate compounds (known and similar) created from the  CAS REGISTRY of chemical substances;
* The dataset is being provided in SDfile format, which includes the complete Molfile representation and other information such as cas.rn, cas.index.name, molecular.formula, molecular.weight, melting.point.experimental, and other property data. 
* Dataset entities represent known anti-viral drugs and related chemical substances that are structurally similar to a known anti-viral.

We have one huge file containing a big list of compounds described in MOL format. One entry looks like this

Cobicistat
C40H53N7O5S2
1004316-88-4 Copyright (C) 2020 ACS
 54 58  0  0  1  0  0  0  0  0999 V2000
48827.327914512.9573    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
48827.3279 9675.3049    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
44637.796116931.7835    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
53016.859716931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
40448.266814512.9573    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.386614512.9573    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
36258.737416931.7835    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
40448.2668 9675.3049    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
61395.913516931.7835    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
36258.737421769.4359    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
32069.208114512.9573    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
61395.913521769.4359    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
65585.445314512.9573    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
32069.208124188.2622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
27879.676316931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
69774.977116931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
23690.146914512.9598    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
27879.676321769.4359    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
73964.508914512.9573    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
69774.977121769.4359    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
19500.617616931.7860    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
23690.1469 9675.3073    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
78154.035816931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
 2843.498713391.2040    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  875.850117810.6207    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000 9477.4613    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.8597 7256.4786    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.8597 2418.8262    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.3866 9675.3049    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.3866    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
61395.9135 7256.4811    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
61395.9135 2418.8262    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.386624188.2622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.386629025.9146    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.859721769.4359    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.859731444.7408    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
48827.327924188.2622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
48827.327929025.9146    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
32069.208129025.9170    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
27879.676331444.7433    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
36258.737431444.7433    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
27879.676336282.3957    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
36258.737436282.3932    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
32069.208138701.2219    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
15311.088314512.9598    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
10891.671516480.6083    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
14805.4142 9701.8075    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
 7654.650912885.5324    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
10073.4772 8696.0031    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
82343.562714512.9573    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
86762.979416480.6083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
82849.2367 9701.8075    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
90000.000012885.5324    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
87581.1738 8696.0031    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  6  0  0  0
  1  3  1  0  0  0  0
  1  4  1  0  0  0  0
  2 27  1  0  0  0  0
  3  5  1  0  0  0  0
  4  6  1  0  0  0  0
  5  7  1  0  0  0  0
  5  8  2  3  0  0  0
  6  9  1  0  0  0  0
  7 10  1  1  0  0  0
  7 11  1  0  0  0  0
  9 12  1  6  0  0  0
  9 13  1  0  0  0  0
 10 14  1  0  0  0  0
 11 15  1  0  0  0  0
 12 33  1  0  0  0  0
 13 16  1  0  0  0  0
 14 39  1  0  0  0  0
 15 17  1  0  0  0  0
 15 18  2  3  0  0  0
 16 19  1  0  0  0  0
 16 20  2  3  0  0  0
 17 21  1  0  0  0  0
 17 22  1  0  0  0  0
 19 23  1  0  0  0  0
 21 45  1  0  0  0  0
 23 50  1  0  0  0  0
 24 25  1  0  0  0  0
 24 26  1  0  0  0  0
 24 48  1  0  0  0  0
 27 28  2  0  0  0  0
 27 29  1  0  0  0  0
 28 30  1  0  0  0  0
 29 31  2  0  0  0  0
 30 32  2  0  0  0  0
 31 32  1  0  0  0  0
 33 34  2  0  0  0  0
 33 35  1  0  0  0  0
 34 36  1  0  0  0  0
 35 37  2  0  0  0  0
 36 38  2  0  0  0  0
 37 38  1  0  0  0  0
 39 40  1  0  0  0  0
 39 41  1  0  0  0  0
 40 42  1  0  0  0  0
 41 43  1  0  0  0  0
 42 44  1  0  0  0  0
 43 44  1  0  0  0  0
 45 46  1  0  0  0  0
 45 47  2  0  0  0  0
 46 48  2  0  0  0  0
 47 49  1  0  0  0  0
 48 49  1  0  0  0  0
 50 51  2  0  0  0  0
 50 52  1  0  0  0  0
 51 53  1  0  0  0  0
 52 54  1  0  0  0  0
 53 54  2  0  0  0  0
M  END
> <cas.rn>
1004316-88-4

> <cas.index.name>
2,7,10,12-Tetraazatridecanoic acid, 12-methyl-13-[2-(1-methylethyl)-4-thiazolyl]-9-[2-(4-morpholinyl)ethyl]-8,11-dioxo-3,6-bis(phenylmethyl)-, 5-thiazolylmethyl ester, (3R,6R,9S)-

> <molecular.formula>
C40H53N7O5S2

> <molecular.weight>
776.02

> <boiling.point.predicted>
974.5±65.0 °C    Press: 760 Torr

> <density.predicted>
1.228±0.06 g/cm3    Temp: 20 °C; Press: 760 Torr

> <pka.predicted>
11.86±0.46    Most Acidic Temp: 25 °C

Is this data interesting for our scope? @mpreusse : "YES"

How can we connect this data to our graph? As discussed with @mpreusse we could search for the molecule name ("C40H53N7O5S2" in the example above) in the :PatentAbstract{text} and connect them. Other ideas are welcome.

dkrizic commented 4 years ago

I worked with compounds with Neo4j. I one project we created a molecular substructure search as plugin for Neo4j, that supported MOL V2000 and SMILES/SMARTS. Is this what you are asking about? I already mentioned this to @mpreusse.

motey commented 4 years ago

Sounds good. Is this plugin open source?

dkrizic commented 4 years ago

No, the plugin is not open source. There are multiple versions and iterations. The first one is based on the EPAM Indigo library (https://lifescience.opensource.epam.com/indigo/) which worked fine, but we always had clashes with the dependent libraries and class loader issues in Neo4j. We had to switch to anther software vendor for a chemical library.

But... I suggest that we implement the following:

It would help me if I understand what we need "in a chemical way"

motey commented 4 years ago

We do need/want a search function for substructures for the application/endusers, correct? Or do we just want to fingerprint the molecues once and connect them? If latter, what about using the indigo library, in a pre-process (in python for example with https://pypi.org/project/epam.indigo/) From my (naive) view that looks a lot less complex, less error-prone and maybe faster as it would process data locally.

seangrant82 commented 4 years ago

would this be helpful in linking between what is in CORD-19 and dataset: https://www.cas.org/covid-19-antiviral-compounds-dataset?utm_source=hootsuite&utm_medium=linkedin&utm_term=&utm_content=9a9f1234-6bd2-4673-9436-bb49800209ca&utm_campaign=COVID-19

bramble50 commented 4 years ago

For processing of molecules I would recommend rdkit: https://github.com/rdkit/rdkit rather than using indigo it's much better supported and has a wider variety of functionality.

I suspect it will be enough just to connect the molecules to the existing data as compound nodes. Ideally by compound->publication_id. You can always add additional features like similarity/substructure searching in afterwards.

In terms of adding/searching for molecules I would stay away from using non-unique keys for these such as the chemical formula (unless just as metadata on the node). Even the CAS numbers used in the above example are not unique or persistent. If there is not a unique identifier from the database they come from/you want to add multiple chemical sources in future then the best way to do this is to calculate inchi's or inchi_key's from the mol_files/SDF's above using RDKit (ideally with a molecular standardisation process as well).

Other datasets I would look at for mapping include ChEMBL, pubchem (although it's a bit noisy) and SureCHEMBL (chembl for patents). I can create extra issues for these if needed.

motey commented 4 years ago

Maybe interesting as well: https://github.com/rdkit/neo4j-rdkit (did not read just stumpled and skimmed. Reminder: Have a closer look.) @sarmbruster was involed in that too

sarmbruster commented 4 years ago

Maybe interesting as well: https://github.com/rdkit/neo4j-rdkit (did not read just stumpled and skimmed. Reminder: Have a closer look.) @sarmbruster was involed in that too

There's a lightning talk recording by the author if the plugin, see https://neo4j.com/online-summit/session/rdkit-neo4j-integration. I've just acted as a mentor.