Add ReadOptions implementation

y1zhou commented 6 months ago

This is to address #125, where we pass in ReadOptions to the open() functions instead of just StrictnessLevel for future extensibility.

[x] PDB parser with ReadOptions
[x] mmCIF parser with ReadOptions
[x] Route to open_gz when options.decompress is true
[x] Remove hydrogens from the model when options.discard_hydrogens is true
[x] Skip other models when options.only_first_model is true
[x] Only parse ATOM records when options.only_atomic_coords is true

Some things up for discussion:

What is the intention of capitalise_chains? If I recall correctly all uppercase letters, lowercase letters, and numbers are valid chain identifiers.
Can I modify the signatures of open_pdb_raw and open_mmcif_raw? They are public functions but it seems they are only meant to be called by open_pdb and open_mmcif. If we maintain open_pdb_raw_with_options and open_mcif_raw_with_options there's going to be a lot of duplicated code (or we extract common methods into another core function).

douweschulte commented 6 months ago

On your questions:

This capitalises all chain ids before storing them, this was the standard behaviour before Oliver removed this behaviour.
I was thinking of making an open_pdb_raw_with_options as a new functions and then rewiring the original open_pdb_raw as a call to that function. If you think this would be too inconvenient for users we might think about breaking the API.

OWissett commented 6 months ago

I like the idea of keeping the API as simple as possible. I think that if we can have fewer functions for reading a file, that would be better, as I agree we don't want to duplicate code and also want it to be clear for end users to know which to use.

I see that there are two kind of ways to do this:

open_pdb_with_opens(..., ReadOptions)
Using the pattern from OpenOptions in the std library: https://doc.rust-lang.org/std/fs/struct.OpenOptions.html

I was initially wanting to change it to be like option 2, but from a time perspective, I just needed to add the functionality as fast as I could, so I went with 1.

What do you think>

douweschulte commented 6 months ago

I think that the open options would be the more clear way of doing this and allowing further expansion in the future. So if everyone here agree I think we can make this a breaking change and move to make this the default. After this work has been done it might be worthwhile to see if we can trivially add back the original public functions with the same api and mark them as deprecated to be removed in the next breaking change (v0.13 ) so that users have at least one major version to move their stuff over.

y1zhou commented 6 months ago

Yeah I'm already down the rabbit hole of open_*_with_options() so let's try that first. I also like the idea of eventually deprecating the functions without ReadOptions to avoid confusion.

OWissett commented 6 months ago

Sounds good. I theory, we are able to have both the approaches of 1 and 2.

I like using the builder pattern, as it provides a good extensible framework where we can add more options without changing the API, this applies to both approaches, but I think that 2 is more idiomatic

y1zhou commented 6 months ago

Alright most implementations are done with the following exceptions:

Atom-only mode for mmCIF files - I'm not familiar enough with the CIF format to know which fields could be safely skipped.
ReadOptions.format: do we keep this for route 2? Currently the input file type is determined at many different places and it might be nice to have a single source of truth.
ReadOptions.capitalise_chains: seems like it's being used in chain_iter of open_pdb_raw_with_options as a fallback? Not sure how the option should interact with this during parsing. Also for mmCIF files there doesn't seem to be a counterpart for this.

y1zhou commented 6 months ago

It looks like you made a lot of new public methods for opening files.

Now is about time to decide which pub functions to keep and which ones to remove (or deprecate). Here's my two cents:

General

Only keep open(filename) and open_with_options(filename, options).

open would be open_with_options with sane default options. Now that level lives in ReadOptions I would also suggest removing the level argument in open().
open_gz can be called by setting options.decompress in open_with_options.
open_raw does some extra magic in filetype guessing which is not very reliable, e.g. an atom-only PDB file doesn't start with HEADER and cannot be identified (somewhat related #122).
- Users could use open_pdb_raw_with_options and open_mmcif_raw_with_options directly.
- Alternatively, we utilize ReadOptions.format and let this determine which internal function to route to.

PDB and mmCIF

Keep open_pdb(filename), open_pdb_with_options(filename, options), open_pdb_raw_with_options(BufReader, context, options).
Keep open_mmcif(filename), open_mmcif_with_options(filename, options), and open_mmcif_raw_with_options(BufReader, context, options).

On the open_*_raw functions, currently open_pdb_raw takes BufReader but open_mmcif_raw takes &str. I think it makes more sense to just expose a function that takes BufReader as the input can be streamed, and converting a &str in memory to a BufReader is trivial.

To summarize,

Function	Action
`general::open`	Remove `level` argument
`general::open_gz`	Deprecate
`general::open_raw`	Deprecate
`pdb::open_pdb`	Remove `level` argument
`pdb::open_pdb_raw`	Deprecate
`mmcif::open_mmcif`	Remove `level` argument
`mmcif::open_mmcif_bufread`	Deprecate
`mmcif::open_mmcif_bufread_with_options`	Deprecate
`mmcif::open_mmcif_raw`	Deprecate
`mmcif::open_mmcif_raw_with_options`	Take `BufReader` instead of `&str`, and parse line-by-line instead

OWissett commented 6 months ago

I like the sound of those changes.

I agree that we still want to have a public function which takes a BufReader as an input, since people may want to do some more complex tasks than getting the library to handling the opening of a file, e.g., maybe receiving over a socket or something...

douweschulte commented 6 months ago

These changes sound very good indeed. But how does this look on the ReadOptions side? There is a function read which takes an &str right now. I would argue that adding a ReadOptions::read_raw is enough to gain feature parity between the old system and the new system. If that function is added I would say we just expose the ReadOptions functions. But to ease the development for all users I think exposing the exact same api as the old system makes sense, but all of these functions should be marked deprecated.

y1zhou commented 6 months ago

Sounds good. I'll add the functions to ReadOptions as well and see how it looks.

y1zhou commented 6 months ago

Things might be a little messy but the changes we discussed last time should be updated now. Please let me know if I missed anything!

douweschulte commented 6 months ago

To help you out this is the output from cargo-public-api run as follows: cargo public-api diff 0.11.0 on your feature branch:

Removed items from the public API
=================================
-pub fn pdbtbx::open_raw<T: std::io::Read + std::io::Seek>(input: std::io::buffered::bufreader::BufReader<T>, level: pdbtbx::StrictnessLevel) -> core::result::Result<(pdbtbx::PDB, alloc::vec::Vec<pdbtbx::PDBError>), alloc::vec::Vec<pdbtbx::PDBError>>

Changed items in the public API
===============================
-pub fn pdbtbx::open(filename: impl core::convert::AsRef<str>, level: pdbtbx::StrictnessLevel) -> core::result::Result<(pdbtbx::PDB, alloc::vec::Vec<pdbtbx::PDBError>), alloc::vec::Vec<pdbtbx::PDBError>>
+pub fn pdbtbx::open(filename: impl core::convert::AsRef<str>) -> core::result::Result<(pdbtbx::PDB, alloc::vec::Vec<pdbtbx::PDBError>), alloc::vec::Vec<pdbtbx::PDBError>>
-pub fn pdbtbx::open_mmcif(filename: impl core::convert::AsRef<str>, level: pdbtbx::StrictnessLevel) -> core::result::Result<(pdbtbx::PDB, alloc::vec::Vec<pdbtbx::PDBError>), alloc::vec::Vec<pdbtbx::PDBError>>
+pub fn pdbtbx::open_mmcif(filename: impl core::convert::AsRef<str>) -> core::result::Result<(pdbtbx::PDB, alloc::vec::Vec<pdbtbx::PDBError>), alloc::vec::Vec<pdbtbx::PDBError>>
-pub fn pdbtbx::open_pdb(filename: impl core::convert::AsRef<str>, level: pdbtbx::StrictnessLevel) -> core::result::Result<(pdbtbx::PDB, alloc::vec::Vec<pdbtbx::PDBError>), alloc::vec::Vec<pdbtbx::PDBError>>
+pub fn pdbtbx::open_pdb(filename: impl core::convert::AsRef<str>) -> core::result::Result<(pdbtbx::PDB, alloc::vec::Vec<pdbtbx::PDBError>), alloc::vec::Vec<pdbtbx::PDBError>>

Added items to the public API
=============================
(ignored)

For full API compatibility there should be no functions listed in the removed and changed items section. So in this case that means keeping the open_raw and keeping the strictnesslevel parameter in the other open functions. As discussed all open functions we do not intend to support should be marked deprecated.

It looks like the original open_raw definition is not easily reproduced with the ReadOptions as it determined the file type with the first characters being HEADER or data_. So if it is not possible to keep this function we could drop this one already in this release.

y1zhou commented 6 months ago

To help you out this is the output from cargo-public-api run as follows: cargo public-api diff 0.11.0 on your feature branch:

This is pretty handy! Thanks for the pointers.

For full API compatibility there should be no functions listed in the removed and changed items section. So in this case that means keeping the open_raw and keeping the strictnesslevel parameter in the other open functions. As discussed all open functions we do not intend to support should be marked deprecated.

It looks like the original open_raw definition is not easily reproduced with the ReadOptions as it determined the file type with the first characters being HEADER or data_. So if it is not possible to keep this function we could drop this one already in this release.

I agree that open_raw should be dropped and marked as a breaking change. Using the first few characters to determine the file type becomes less reliable especially with the atomic_only modifiers being introduced.

For the other functions, adding back level to open_pdb and open_mmcif makes sense as they are deprecated anyway. For open, what would be the best approach to modify its signatures? In our previous discussions I think we agreed to let open use all default options, and ReadOptions to handle all other cases. Do we still want the old open(filename, level) or just open(filename)?

douweschulte commented 6 months ago

For open lets look at the standard library. There there are very simple 'magical' functions alongside the open options, which is exactly as discussed before. So we should go for that (using open(filename)) and we should try and make it abundantly clear how to convert existing code in the release notes.

y1zhou commented 6 months ago

Sounds good! We have one removed item open_raw and one changed item open. Hopefully it's not too much burden to the end users to migrate existing code to ReadOptions. Looking forward to v0.12.0!

y1zhou commented 5 months ago

Just want to follow up on this PR and see if there's anything to be updated before this can be merged.

douweschulte commented 5 months ago

I propose a merge (I was awaiting a response from @OWissett).

OWissett commented 5 months ago

thanks for merging. Apologies for delay, I was on holiday

douweschulte / pdbtbx

Add ReadOptions implementation #126

General

PDB and mmCIF