asapdiscovery / asapdiscovery

Toolkit for open antiviral drug discovery by the ASAP Discovery Consortium
https://asapdiscovery.org
MIT License
25 stars 1 forks source link

Some pdbbind structures cause unexpected termination when running "asap-ml build-dataset" #854

Open robby-wang opened 4 months ago

robby-wang commented 4 months ago

Command that I ran:

asap-ml build-dataset schnet     --exp-file unknown_error.json   \
  --structures '*_complex.pdb'  \
  --ds-cache ~/dataset_cache_local.pkl    \
  --xtal-regex '(?<=\/)[A-Za-z0-9]{4}(?=_complex)'  \
  --cpd-regex '(?<=\/)[A-Za-z0-9]{4}(?=_complex)'   \
  --ds-config-cache ~/dataset_config_cache_local.json

unknown_error.json error_complexes.zip unzip to get the three complex.pdb files.

Description: When building the dataset for pdbbind structures, whenever the dataset has any of these 3 complexes below, it will give this error: Fatal: Cannot read molecule

Upon checking, the ligand and protein is parsed successfully for each of these complexes. In asapdiscovery-ml/schema_v2/config.py: Complex.from_pdb was able to load the protein and ligand as input data. But somehow later this error occurred.

This issue only happened to 3/4606 of the pdbbind complex.pdb structures: 5vh0, 6eiz, 6a87. Excluding them from the schema and structure list solved the problem. But it would be interesting to look into the reason why they are causing the error.



PS: It was a big hassle to locate exactly these 3 problematic structures out of ~5000 structures. The ligand and protein would appear to be read successfully, and the error would happen at a very late stage. Originally thought it was a systematic error with "build-dataset" script, but when I tried using 200 structures it completed without error. So at the end I increased from 300, 400, 500 …, to 5000 structures manually to figure out the 3 structures that were actually causing this issue.

hmacdope commented 4 months ago

Thanks for this @robby-wang !

@kaminow we can wrap the reading in a try: except: to drop failed reads, but we should try and figure out what is happening also