Closed y1zhou closed 6 months ago
Depending on what the exact use would be I have two options I see to solve this. Either downgrade the Remark type invalid error to LooseWarning
that way these files can be openend with StrictnessLevel::Loose
. Another option would be to have the option only_read_atomic_data
in the ReadOptions
(upcoming for the next release) which would ignore any line not starting with ATOM
as you described.
The main decision now has to be if the only_read_atomic_data
is useful on its own or if just fixing this one warning is enough to make these class of files work. Do you have an opinion on this?
Depending on what the exact use would be I have two options I see to solve this. Either downgrade the Remark type invalid error to
LooseWarning
that way these files can be openend withStrictnessLevel::Loose
.
If the long-term goal is to correctly parse all REMARK
records then I don't think we should downgrade it into a LooseWarning
. Sticking with the official specs would be easier for future maintenance.
Another option would be to have the option
only_read_atomic_data
in theReadOptions
(upcoming for the next release) which would ignore any line not starting withATOM
as you described.
You are referring to https://github.com/douweschulte/pdbtbx/pull/120 right? Love the idea!
The main decision now has to be if the
only_read_atomic_data
is useful on its own or if just fixing this one warning is enough to make these class of files work. Do you have an opinion on this?
IMO having the only_read_atomic_data
would be very useful for parsing these model files. If we fix the one warning there could be hundreds of edge cases in the future that also require similar attention.
On sticking to the official specs, only the Strict
variant actually sticks to the spec, Medium
is intended to allow for some out of spec uses that are quite common but still very close to spec and Loose
is intended to allow for even more out of spec uses that are somewhat more ambiguous or more out of spec than Medium
. For that reason I would like to allow these files in the Loose
category, if you could provide an example I could try what needs to be changed to allow this through.
The only_atomic_data
concept is I think a different story although it would also solve your problem. Doing this would remove some runtime and memory consumption for anyone not interested in the additional data lines (remarks, database reference, etc). And this would be a way out for any file very far out of spec.
On to the implementation then. Based on the example file I would suggest to do both. Are you interested in working on the code, I am able to coach you through it if you want. If not I think I have the time to do it as well.
Sounds good! I can give it a shot and start with converting the warnings to Loose
, adding test cases, etc. which should be fairly easy. For only_atomic_data
I'll ask for help in the PR if I bump into any troubles.
Just thought I'd weigh in here. I really like this idea, I feel that lots of the information in PDB files is not needed for many applications, and I have had tons of issues with malformed PDB files, so adding this would be great.
Although the library was designed to work with crystallographic files, many modeling software would also generate files of this format. These files usually only focus on atomic coordinates, and may come with some annotation data that often don't comply with the format specification. For example, Rosetta would generate
REMARK 220
fields in the header, whichpdbtbx
would complain with:Currently there is support for parsing files with different
StrictnessLevel
s, and I was wondering if it would be possible to add an atoms-only parsing mode where all other information are skipped. This would be similar to running agrep ^ATOM
prior to parsing withpdbtbx
.Thanks for the amazing library and I'm open to provide test cases or further details.