PatWalters / rd_filters

A script to run structural alerts using the RDKit and ChEMBL
MIT License
125 stars 37 forks source link

Stoping Early: Large Smi File #10

Closed jacob-r-anderson closed 4 years ago

jacob-r-anderson commented 4 years ago

On a large smile file the program seems to end early (< 10% of the way through). (my-rdkit-env) [me]$ wc -l 01_split.smi 7113315 01_split.sm (my-rdkit-env) [Me]$ rd_filters filter --in 01_split.smi --prefix 01_filtered --rules rules.json --alerts alert.csv using 4 cores Using alerts from Inpharmatica and PAINS [09:05:36] Explicit valence for atom # 1 N, 5, is greater than permitted [09:06:02] Conflicting single bond directions around double bond at index 22. [09:06:02] BondStereo set to STEREONONE and single bond directions set to NONE. [09:06:42] Conflicting single bond directions around double bond at index 22. [09:06:42] BondStereo set to STEREONONE and single bond directions set to NONE. [09:07:08] Conflicting single bond directions around double bond at index 22. [09:07:08] BondStereo set to STEREONONE and single bond directions set to NONE. [09:07:56] Conflicting single bond directions around double bond at index 22. [09:07:56] BondStereo set to STEREONONE and single bond directions set to NONE. Wrote SMILES for molecules passing filters to 01_filtered.smi Wrote detailed data to 01_filtered.csv 13281 of 82704 passed filters 16.1% Elapsed time 197.91 seconds

I looked in the input file at lines above and below 82704 and nothing seems to be awry.

c1csc(c12)CCN([C@@H]2CC)C(=O)NC@Hc3c(C)nn(C)c3 316831704 316831704 - 82703 CCC(CC)C@@HC(=O)Nc(cn(n1)C)c1-c2ccnn2C 319220015 319220015 - 82704 n1cc(O)ccc1CC(=O)N(CC2=O)CCCN2CC 319374292 319374292 - 82705

jacob-r-anderson commented 4 years ago

Closing issue. Had a mix of compounds with three columns that were excluded in this line:

input_data = [x for x in input_data if len(x) == 2]

Removed the redundant name column with:

awk '{print $1,$2}' input.smi > finput.smi