douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.
https://crates.io/crates/pdbtbx
MIT License
49 stars 12 forks source link

Read option to only parse `ATOM` records #123

Closed y1zhou closed 3 months ago

y1zhou commented 3 months ago

Although the library was designed to work with crystallographic files, many modeling software would also generate files of this format. These files usually only focus on atomic coordinates, and may come with some annotation data that often don't comply with the format specification. For example, Rosetta would generate REMARK 220 fields in the header, which pdbtbx would complain with:

StrictWarning: Remark type number invalid
The remark-type-number is not valid, see wwPDB v3.30 for all valid numbers.

Currently there is support for parsing files with different StrictnessLevels, and I was wondering if it would be possible to add an atoms-only parsing mode where all other information are skipped. This would be similar to running a grep ^ATOM prior to parsing with pdbtbx.

Thanks for the amazing library and I'm open to provide test cases or further details.

douweschulte commented 3 months ago

Depending on what the exact use would be I have two options I see to solve this. Either downgrade the Remark type invalid error to LooseWarning that way these files can be openend with StrictnessLevel::Loose. Another option would be to have the option only_read_atomic_data in the ReadOptions (upcoming for the next release) which would ignore any line not starting with ATOM as you described.

The main decision now has to be if the only_read_atomic_data is useful on its own or if just fixing this one warning is enough to make these class of files work. Do you have an opinion on this?

y1zhou commented 3 months ago

Depending on what the exact use would be I have two options I see to solve this. Either downgrade the Remark type invalid error to LooseWarning that way these files can be openend with StrictnessLevel::Loose.

If the long-term goal is to correctly parse all REMARK records then I don't think we should downgrade it into a LooseWarning. Sticking with the official specs would be easier for future maintenance.

Another option would be to have the option only_read_atomic_data in the ReadOptions (upcoming for the next release) which would ignore any line not starting with ATOM as you described.

You are referring to https://github.com/douweschulte/pdbtbx/pull/120 right? Love the idea!

The main decision now has to be if the only_read_atomic_data is useful on its own or if just fixing this one warning is enough to make these class of files work. Do you have an opinion on this?

IMO having the only_read_atomic_data would be very useful for parsing these model files. If we fix the one warning there could be hundreds of edge cases in the future that also require similar attention.

douweschulte commented 3 months ago

On sticking to the official specs, only the Strict variant actually sticks to the spec, Medium is intended to allow for some out of spec uses that are quite common but still very close to spec and Loose is intended to allow for even more out of spec uses that are somewhat more ambiguous or more out of spec than Medium. For that reason I would like to allow these files in the Loose category, if you could provide an example I could try what needs to be changed to allow this through.

The only_atomic_data concept is I think a different story although it would also solve your problem. Doing this would remove some runtime and memory consumption for anyone not interested in the additional data lines (remarks, database reference, etc). And this would be a way out for any file very far out of spec.

On to the implementation then. Based on the example file I would suggest to do both. Are you interested in working on the code, I am able to coach you through it if you want. If not I think I have the time to do it as well.

y1zhou commented 3 months ago

Sounds good! I can give it a shot and start with converting the warnings to Loose, adding test cases, etc. which should be fairly easy. For only_atomic_data I'll ask for help in the PR if I bump into any troubles.

OWissett commented 3 months ago

Just thought I'd weigh in here. I really like this idea, I feel that lots of the information in PDB files is not needed for many applications, and I have had tons of issues with malformed PDB files, so adding this would be great.