Shen-Lab / DeepAffinity

Protein-compound affinity prediction through unified RNN-CNN
GNU General Public License v3.0
137 stars 30 forks source link

DeepAffinity ID and Public ID Mapping Table #7

Open hyojin0912 opened 3 years ago

hyojin0912 commented 3 years ago

Thanks for your great packages.

I want to get mapping table (DeepAffinity ID to public ID) for Protein and Drug each. (For interpreting DeepAffinity output.)

e.g. For Protein (Map to Uniprot AC) KV37 | P1234 e.g. For Drug (Map to Pubchem CID) KV37 | 12345

And could you re-describe how you made pubchem binary fingerprint per compound. -> Input=Pubchem CID or SMILES, -> Output=0/1 881 length vector

Thanks in advance

Shen-Lab commented 3 years ago

Thank you for your interest in our package. For the mapping between internal DeepAffinity IDs and external public IDs, you can find in cleared data description (https://github.com/Shen-Lab/DeepAffinity/blob/master/data/README.md) that, for each pair of compound and protein, we gave DeepAffinity Protein ID, UniProt Protein ID, DeepAffinity Compound ID, PubChem CID, and the affinity measurement of the pair. The mapping is not currently provided in the desired format but could be easily converted using the provided files.

For the 881-digit PubMed fingerprint, I believe that we retrieved it for each compound from PubChem using the compound's CID. @AstroSign Could you please provide some more details? Was it done through some PubChem API, say PubChemPy?

AstroSign commented 3 years ago

In terms of getting fingerprints of compounds, we downloaded data in SDF format based on CID from https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi . They also provide other APIs which can be found here: https://pubchemdocs.ncbi.nlm.nih.gov/downloads . In the SDF file, the fingerprint features are stored in Base64-encoded format under the tag "PUBCHEM_CACTVS_SUBSKEYS". To understand the meaning of each digit and how to decode, you may refer to their document ( https://pubchemdocs.ncbi.nlm.nih.gov/data-specification).

Thanks.

On Wed, Jan 27, 2021 at 1:46 PM Shen Lab at Texas A&M University < notifications@github.com> wrote:

Thank you for your interest in our package. For the mapping between internal DeepAffinity IDs and external public IDs, you can find in cleared data description ( https://github.com/Shen-Lab/DeepAffinity/blob/master/data/README.md) that, for each pair of compound and protein, we gave DeepAffinity Protein ID, UniProt Protein ID, DeepAffinity Compound ID, PubChem CID, and the affinity measurement of the pair. The mapping is not currently provided in the desired format but could be easily converted using the provided files.

For the 881-digit PubMed fingerprint, I believe that we retrieved it for each compound from PubChem using the compound's CID. @AstroSign https://github.com/AstroSign Could you please provide some more details? Was it done through some PubChem API, say PubChemPy?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Shen-Lab/DeepAffinity/issues/7#issuecomment-768047197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM5XI4MOCCWYT4KD55K54TS36SCBANCNFSM4WTOKGNQ .

hyojin0912 commented 3 years ago

Thank you for your comments. I got PubChem Fingerprint from SDF file. But if I transform Base 64 encoded vector to binary bit vector, it's length is 920 not 881. Could you describe me how to select exact fingerprint vector?

For example, Lorlatinib's (PubChem CID=71731823) fingerprint -> Base64 Encoded Fingerprint = AAADceB7sQAAAAAAAAAAAAAAAAAAAWAAAAA8QAAAAAAAAAAB8AAAHwAYAAAADBzhng4/tpNIFAC6Bzd3dASyjCk14CAY2CE/TNiO5vLE9duXvSjkzhPY6a+62KOOgAAAAAAQAAAAAAAAACAAAAAAAAAAAA== Binary Bit Vector = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

AstroSign commented 3 years ago

As I mentioned earlier, you may refer to the last two paragraphs in https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt . As it described, "The fingerprint is, therefore, 111 bytes in length (888 bits), which includes padding of seven bits at the end to complete the last byte. A four-byte prefix, containing the bit length of the fingerprint (881 bits)".

Thanks.

On Wed, Mar 3, 2021 at 11:19 PM Sonhyojin notifications@github.com wrote:

Thank you for your comments. I got PubChem Fingerprint from SDF file. But if I transform Base 64 encoded vector to binary bit vector, it's length is 920 not 881. Could you describe me how to select exact fingerprint vector?

For example, Lorlatinib's (PubChem CID=71731823) fingerprint -> Base64 Encoded Fingerprint = AAADceB7sQAAAAAAAAAAAAAAAAAAAWAAAAA8QAAAAAAAAAAB8AAAHwAYAAAADBzhng4/tpNIFAC6Bzd3dASyjCk14CAY2CE/TNiO5vLE9duXvSjkzhPY6a+62KOOgAAAAAAQAAAAAAAAACAAAAAAAAAAAA== Binary Bit Vector = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Shen-Lab/DeepAffinity/issues/7#issuecomment-789791735, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM5XIZIZSS5MBYGCQXAJE3TBZHRNANCNFSM4WTOKGNQ .