darkreactions / chemdescriptor

Generic molecular descriptor generator package
MIT License
6 stars 1 forks source link

First column of output CSV file is 'id' with ascending integer values for each instance instead of 'Compound' with SMILES listed #2

Closed notanotterpun closed 5 years ago

notanotterpun commented 5 years ago

I see in the examples (and the code) that the first column of the output csv file should contain the input SMILES, but when I run the program I just get id numbers and there is no way that I see to map them back to the input SMILES (the output contains a smaller number of instances than were input in the SMILES file). Attached is a screenshot of my output.

Screen Shot 2019-08-06 at 11 48 14 AM

vshekar commented 5 years ago

@notanotterpun Do you mind sharing the inputs you are using for your output? Also which version of chemdescriptor are you using? You can get the version by typing pip show chemdescriptor

notanotterpun commented 5 years ago

@vshekar Thanks for the quick reply! I am using version 0.0.3. I've attached my input SMILES file. Otherwise, I followed the process outlined in the "In Code" section.

Analogues_Hull_SMILES1.txt

vshekar commented 5 years ago

Hi @notanotterpun , I've updated chemdescriptor to version 0.0.5. The issue is that before generating descriptors we convert the smiles code to a Low energy conformer SDF representation. Apparently you need a separate license for leconformer to work. So if you don't have a license you get the output you posted.

Now the leconformer step is skipped by default, if you want to use it (and have the correct license) add the lec=True keyword arg in the generator function. The output should be correct now.

Also, many of the smiles in the file you provided are incorrect, you may want to check that. Let me know if it works and I'll close this issue

notanotterpun commented 5 years ago

@vshekar I updated to version 0.0.5 and set lec=True in the generate function (I don't know if I have the correct license, but it didn't complain about a license when I ran it) and I am still getting the same output as before. The SMILES in that file I'm using are generated by an artificial neural network; it has a basic SMILES checker, but not a robust one--hence all the weird SMILES.

I then ran it again with this attached list of SMILES that I know are valid, still with lec=True, and this time it raised the Exception 'No columns to parse from file' and output an empty csv file.

test_smiles.txt

notanotterpun commented 5 years ago

@vshekar Update: I ran it again with the new version with the valid test_smiles file, the only difference being that I set lec=False this time. It populated the csv file this time, but still with 'id' instead of 'Compound' in the first column.

vshekar commented 5 years ago

@notanotterpun Unfortunately I'm not able to recreate your issue and I'm correctly generating descriptors for your smiles file. Please take a look at the zip attached where you'll find the output for test_smiles.txt and the script that I used to generate it. The only other thing I can think of is that we may be working on significantly different versions of cxcalc (I'm running version 19.12.0). Ideally, when you run the script I've provided you should not see any messages on the console if the program runs correctly

cag_test_code.zip

notanotterpun commented 5 years ago

@vshekar I ran your script with my smiles file and got the same output as you. So it appears that the issue is arising when I don't use the defaults. I need a different set of properties calculated than what are included in the default dicts, and it seems that overwriting the dictionaries is where the problem arises. That being said, I don't know why that is the case.

vshekar commented 5 years ago

I'm curious as well, what are the properties you are trying to calculate? That would help me debug

notanotterpun commented 5 years ago

No problem. Here they are:

exactmass atomcount molpol tpol asa aromaticatomcount aromaticbondcount aromaticringcount asymmetricatomcount balabanindex carboaromaticringcount carboringcount chiralcentercount cyclomaticnumber dreidingenergy fusedaliphaticringcount fusedaromaticringcount fusedringcount hararyindex hyperwienerindex maximalprojectionarea maximalprojectionradius minimalprojectionarea minimalprojectionradius msa plattindex ringatomcount ringbondcount ringcount psa randicindex szegedindex topanal wienerindex wienerpolarity doublebondstereoisomercount stereoisomercount tautomercount tetrahedralstereoisomercount enumerationcount markushenumerationcount logp logd pka hbda refractivity resonantcount chainbondcount largestringsize rotatablebondcount acceptorcount donorcount

vshekar commented 5 years ago

I think I've fixed it, the issue was that I assumed the number of columns generated by cxcalc is the same as the number of descriptors you are trying to calculate but apparently that's not true. You have 52 descriptors here and cxcalc generates ~98 columns.

I've updated chemdescriptor to 0.0.6, please check to see if it works. I've attached my copy of the code where the descriptor_list.json is what you are using above.

Also, this code is part of a DARPA funded project, do you mind sharing which org you are part of and what you are planning to do with this? Our team would be happy to hear more/help

cag_test_code2.zip

notanotterpun commented 5 years ago

Now it's working as long as the SMILES are good, but with the files that contain weird SMILES it still makes the id column. Out of curiosity, what do you use to verify SMILES?

I am a graduate student and I am using this for work I'm doing as part of an internship at Pacific Northwest National Lab. The project is for targeted molecule generation by a deep learning neural network based on chemical properties. Clearly we're not where we would like to be yet on the SMILES discriminator end :)

vshekar commented 5 years ago

With regards to smiles verification unfortunately that is not within the scope of this package. But you can use rdkit to filter out incorrect smiles as described here before giving it to chemdescriptor.

m = Chem.MolFromSmiles('Cc1ccccc1') will either return a Mol object on success or None if the SMILES is incorrect.

Our funding agency will be glad to know that other groups are using this as well. I was also an intern at PNNL a couple of summers ago in the Richland campus. Good luck with your research!