Number of molecular descriptors obtained with PaDEL differs from the number of molecules in the molecule.smi file

sayalaruano commented 3 years ago

Hello professor, I’m doing EDA and calculation of molecular descriptors of the betalactamase dataset. I replaced duplicated values by the mean of them as you suggested, and filtered only molecules that bind to Betalactamase AmpC, and I have a dataset with 62050 molecules. Then, I followed instructions to calculate molecular descriptors with paDELpy from the video of description, but I obtained molecular descriptors of only 5534 molecules although my molecule.smi file has 62050 molecules. Do you know if there are restrictions regarding the number of molecules for calculating descriptors in paDEL ? or this error can be associated with something from my code ?. This GitHub repo contains my notebook and all files: https://github.com/sayalaruano/MidtermProject-MLZoomCamp. I added the same comment in the youtube video of the challenge, just in case. Thanks in advance for your help.

wguesdon commented 3 years ago

I obtained 1412 rows myself as can be seen here: https://github.com/wguesdon/beta-lactamase/blob/main/Data_Wrangling_and_EDA.ipynb. I wonder if we could apply the padelpy method row by row via a lambda function?

sayalaruano commented 3 years ago

I just come up with the solution for this error. The mistake was that I maintain in my dataset some molecules with NaN in canonical smile feature, so padel only calculate fingerprints for molecules above the first NaN. Now, I will try to calculate the 12 fingerprints for all molecules. I hope I can calculate all of them.

wguesdon commented 3 years ago

Thank you for sharing, it must have been the same issue for me.

sayalaruano commented 3 years ago

You're welcome @wguesdon, this is the good part of these collaborative projects :)

semsem80 commented 3 years ago

Hello sayalaruano,

I have the same problem. I obtained molecular descriptors of PubChem only 338 molecules although my molecule.smi file has 64424 molecules.

sayalaruano commented 3 years ago

Hello @semsem80 , to solve this error, you need to delete molecules with NaN in canonical_smile feature. In this way, you can solve this problem. Hope this can be helpful, let me know if it works.

semsem80 commented 3 years ago

Hi @sayalaruano, your suggested solution worked, thank you for your help.

dataprofessor / beta-lactamase

Number of molecular descriptors obtained with PaDEL differs from the number of molecules in the molecule.smi file #2