biojava / biojava

:book::microscope::coffee: BioJava is an open-source project dedicated to providing a Java library for processing biological data.
https://biojava.org
GNU Lesser General Public License v2.1
588 stars 385 forks source link

CONECT records multiplied when parsing structures #990

Closed erikedlund closed 2 years ago

erikedlund commented 2 years ago

Hello,

I've noticed an issue where parsed Structure objects aren't consistently creating CONECT records when exported to PDB format. This appears to be a regression since version 4.2.8. Here's an example with RCSB's 4HHB.

I'd expect these counts to match, but they appear to be multiplied.

4hhb.pdb.txt

PDBFileParser parser = new PDBFileParser();
FileParsingParameters params = new FileParsingParameters();
params.setCreateAtomBonds(true); // should create the bonds we want
parser.setFileParsingParameters(params);
String txt = Files.readString(new File("4hhb.pdb.txt");
System.out.println(String.format("File %s contains %s CONECT records: ", struct.getName(), txt.split("CONECT").length - 1));
Structure s = parser.parsePDBFile(new FileInputStream(struct));
String parsed = s.toPDB();
System.out.println(String.format("Parsed structure contains %s CONECT records: ", parsed.split("CONECT").length - 1));

File 4hhb.pdb.txt contains 180 CONECT records: Parsed structure contains 9408 CONECT records:

edit: fixing counts

aalhossary commented 2 years ago

@erikedlund What exactly is struct?

erikedlund commented 2 years ago

@aalhossary It's a File object pointing to the attached 4hhb.pdb.txt file. Sorry, I intended to add all the necessary code but failed to get the formatting correct, hence all the edits.

aalhossary commented 2 years ago

It does not seem like a problem. A single CONNECT record holds all connections an atom has to its neighbor atoms (up to 4 neighbors). Therefore, it is expected to have the total number of bonds > the number of connect lines. Please refer to the documentation here.

erikedlund commented 2 years ago

I see, so BioJava is generating its own exhaustive list of all connections rather than simply passing along the contents of the input? That would account for the difference.

Are duplicates expected then? I see two identical CONECT records for all paired atoms in the example above:

CONECT    1    2                                                                
CONECT    1    2

Is that just the mutual connection of 1 to 2 and 2 to 1 being counted twice?

aalhossary commented 2 years ago

Well, after double checking, Yes. This is a full list of ALL connections, including connections between all known connecting atoms in the file. The number is duplicated because every connection is counted twice (a -> b, b -> a)

aalhossary commented 2 years ago

@erikedlund if you don't have more comments, may you close this issue please?