New info method for the universe.

AHMED-salah00 commented 1 year ago

Is your feature request related to a problem?

I've noticed that when I use mdanalysis It's kind of time consuming to spend time exploring the universe using many methods.

Describe the solution you'd like

so I think making an info method will be useful Universe.info that prints these information about the universe:

Number of atoms, residues, segments, bonds, angles, dihedrals, impropers.
Periodic boundary conditions.
Number of frames.
Timestep.
Time range.
Trajectory format.

You can also see a prototype implementation for .info here : "PR #4046".

Additional context

What made me think about that is that I'm a data scientist and I always call df.info() in pandas when I import a data set to know some quick info about what I am dealing with.

orbeckst commented 1 year ago

Can you show example output, eg for the Universe(PSF, DCD) and Universe(TPR, XTC) using from MDAnalysisTests.datafiles import PSF, DCD, TPR, XTC?

AHMED-salah00 commented 1 year ago

Example:

input =>>

from MDAnalysisTests.datafiles import PSF, DCD, TPR, XTC
u = Universe(TPR, XTC)
u.info

out =>>


MDAnalysis Universe object
==========================
Number of atoms: 47681
Number of residues: 11302
Number of segments: 3
Number of bonds: 25533
Number of angles: 6123
Number of dihedrals: 7481
Number of impropers: 0
Periodic boundary conditions: [80.017006 80.017006 80.017006 60.       60.       90.      ]
Trajectory information:
  - Number of frames: 10
  - Timestep: 100.000 ps
  - Time range: 0.000 ps to 900.000 ps
  - Trajectory format: <XTCReader /home/ahmed/.local/lib/python3.10/site-packages/MDAnalysisTests/data/adk_oplsaa.xtc with 10 frames of 47681 atoms>

orbeckst commented 1 year ago

What does it look like for PSF, DCD?

AHMED-salah00 commented 1 year ago

input =>>

from MDAnalysisTests.datafiles import PSF, DCD, TPR, XTC
u = Universe(PSF, DCD)
u.info

output =>>


MDAnalysis Universe object
==========================
Number of atoms: 3341
Number of residues: 214
Number of segments: 1
Number of bonds: 3365
Number of angles: 6123
Number of dihedrals: 8921
Number of impropers: 541
Periodic boundary conditions: None
Trajectory information:
  - Number of frames: 98
  - Timestep: 1.000 ps
  - Time range: 1.000 ps to 98.000 ps
  - Trajectory format: <DCDReader /home/ahmed/.local/lib/python3.10/site-packages/MDAnalysisTests/data/adk_dims.dcd with 98 frames of 3341 atoms>

IAlibay commented 1 year ago

What happens here if you have a format without a given set of attributes, like say a XYZ that doesn't set bonds, etc...?

The other question is also - how is the list of attributes decided here? Are we saying that these attributes are the attributes we expect to be canonical? What if someone comes along and wanted to add more attributes? Do we then need to have folks arbitrarily adding their favourite attributes until info becomes an unmanageable method?

AHMED-salah00 commented 1 year ago

These are really good questions.It's surely not always useful to use it ,or won't give you a complete overview for every universe. There are some attributes that should be canonical for most cases, and shouldn't be like adding all the possible attributes to have the best expectations.But we still can manage it to be helpful in all cases.That's why we should discuss about it.

AHMED-salah00 commented 1 year ago

I think this should work now in case of XYZ Universe with this modification:

try:
    print(f"Number of bonds: {len(self.bonds)}")
    print(f"Number of angles: {len(self.angles)}")
    print(f"Number of dihedrals: {len(self.dihedrals)}")
    print(f"Number of impropers: {len(self.impropers)}")
except NoDataError:
    pass

The output should look like this:

MDAnalysis Universe object
==========================
Number of atoms: 1284
Number of residues: 1
Number of segments: 1
Trajectory information:
  - Number of frames: 10
/site-packages/MDAnalysis/coordinates/base.py:721: UserWarning: Reader has no dt information, set to 1.0 ps
  return self.ts.dt
  - Timestep: 1.000 ps
/mdanalysis/package/MDAnalysis/core/universe.py:1467: UserWarning: Reader has no dt information, set to 1.0 ps
  print(f"  - Time range: {self.trajectory[0].time:.3f} ps to {self.trajectory[-1].time:.3f} ps")
  - Time range: 0.000 ps to 9.000 ps
Periodic boundary conditions: None
  - Trajectory format: <XYZReader /site-packages/MDAnalysisTests/data/2r9r-1b.xyz with 10 frames of 1284 atoms>

tylerjereddy commented 1 year ago

Interesting idea

hmacdope commented 1 year ago

I actually think this is a good idea.

richardjgowers commented 1 year ago

Something that would be interesting to add would be a list of the available topology attributes (u._topology.attrs)

orbeckst commented 1 year ago

I am a bit wary for the same reasons as https://github.com/MDAnalysis/mdanalysis/issues/4050#issuecomment-1451091775 .

How do we decide what counts as a canonical information? Do we have some input from users? Is it worthwhile surveying our users, e.g., rank order the following pieces of information that you want to know about a Universe: 1. trajectory length, 2. number of atoms, 3.
How to deal gracefully with missing information?
Is there a way to just go through the list of available information and show everything naturally defined (u._topology.attrs , https://github.com/MDAnalysis/mdanalysis/issues/4050#issuecomment-1452176645 ). Is there a similar approach for trajectory information? I'd like to avoid too much hard-coded print calls....
Does pandas do anything interesting under the hood with DataFrame.info() ? Given that this is the inspiration, look there too.

IAlibay commented 1 year ago

I agree with @orbeckst's comments - if there's a way to avoid a bunch of hardcoded print calls then it seems doable.

AHMED-salah00 commented 1 year ago

@orbeckst I completely understand where you're coming from! I believe an info method. should be as simple as possible. When the user calls it ,he probably don't know anything about the Universe yet. In pandas ,honestly I don't know how is it implemented yet(I'll look there anyway) but I know when to use it and what for!

In my opinion Canonical information are those which we need to know every time we start to use a specific Universe.They are essential to have a top view on a universe whatever kind of data it has.I don't see a need for surveying our users, otherwise it's a useless method ,it won't facilitate anything then.

I don't understand what data exactly, do you mean like whole missing attributes ? hemm.. we can check for them using hasattr() e.g.:

if hasattr(self, 'atoms'):
    print('number of atoms ', )
else:
    print("No atom information available")

for trajectory info:

if not  self.trajectory:
    print('trajectory info ', ...)
else:
    print("NO trajectory info")

I honestly don't think it's the most robust way to list all the info list u._topology.attrs.

I'd like to avoid too much hard-coded print calls....

Sure, but how are we gonna view the info? @IAlibay any suggestions?

IAlibay commented 1 year ago

Sure, but how are we gonna view the info?

I'm not sure I understand the question here, could you expand on what you mean here?

On our end, what @orbeckst and I mean here is that if we're generating a list of attributes that get printed out to the users, this shouldn't be just a long list of print statements but something that programmatically interrogates and outputs the contents of the Universe object. We aren't saying "no print", but rather "print but not as a hardcoded list".

we can check for them using hasattr()

and

hey are essential to have a top view on a universe whatever kind of data it has

What we are trying to get to here is that Universes have the ability to have arbitrary numbers of data associated with it (especially when it comes to attributes). You also have the added complexity that in some cases some attributes might get populated but are simply just a list of empty strings (see elements for a case where that might happen). This is part of the reasoning behind why we are saying the above re: not a simple list of print statements.

AHMED-salah00 commented 1 year ago

I tried to make the code more robust and flexible and not a list of hard coded prints:

    def info(self):
        """
        Print information about the Universe object.

        - Number of atoms, residues, segments, bonds, angles, dihedrals, impropers.
        - Periodic boundary conditions.
        - Number of frames.
        - Timestep.
        - Time range.
        - Trajectory format.

        """
        print("MDAnalysis Universe object")
        print("==========================")

        # Defined a list of tuples containing the name of the property and the corresponding value to be printed

        # Basic information
        properties = [("- Number of atoms", self.atoms.n_atoms),
                      ("- Number of residues", self.residues.n_residues),
                      ("- Number of segments", self.segments.n_segments),
                      ("- Periodic boundary conditions", self.dimensions)]
        # topology information
        try:
            properties += [("- Number of bonds", len(self.bonds)),
                           ("- Number of angles", len(self.angles)),
                           ("- Number of dihedrals", len(self.dihedrals)),
                           ("- Number of impropers", len(self.impropers))]
        except NoDataError:
            pass

        # Print all properties using a for loop
        for property_name, property_value in properties:
            print(f"{property_name}: {property_value}")

        print("Trajectory information:")
        print(f" - Number of frames: {self.trajectory.n_frames}")
        print(f" - Timestep: {self.trajectory.dt:.3f} ps")

        # Check if the trajectory has time information
        if len(self.trajectory) > 0:
            print(f"Time range: {self.trajectory[0].time:.3f} ps to {self.trajectory[-1].time:.3f} ps")
        else:
            print("  No time information available")

        print(f"  Trajectory format: {self.trajectory}")

orbeckst commented 1 year ago

@AHMED-salah00 getting a new feature into MDAnalysis is not an easy task because we have to weigh very carefully the benefits against the costs. I am saying this so that you understand why I am currently not reviewing your PR #4046 and instead keep discussing here until it's clear that (1) we want the feature and (2) we are convinced that your overall approach is correct.

Ultimately, there need to be enough @MDAnalysis/coredevs who think that this feature is worthwhile and you need to come up with answers to our questions and concerns.

AHMED-salah00 commented 1 year ago

It's surely not an easy task! I appreciate your concerns, and I will happily try as hard as I can to answer your questions.

AHMED-salah00 commented 1 year ago

I have looked for the DataFrame.info()method in pandas library and found out some interesting facts about how it works:

DataFrame.info(): This is the main function that prints information about a DataFrame. It creates a DataFrameInfo object, which stores information about the DataFrame.
DataFrameInfo: Is a class which is initialized with given dataframe and memory using options.
DataFrameInfo.render(): This method renders the information stored in a DataFrameInfo object, using a DataFrameInfoPrinter object to format the output.
DataFrameInfoPrinter: This is a class initialized with maximum number of columns to display, verbosity, and whether to show null value counts.

So when DataFrame.info() is called, it creates a DataFrameInfo object with the necessary information about the DataFrame, and then calls DataFrameInfo.render() to print the formatted output. The DataFrameInfo.render() method uses a DataFrameInfoPrinter object to format the output, and then writes the output to a buffer using the DataFrameInfoPrinter.to_buffer()method.

orbeckst commented 1 year ago

Good that you looked at pandas.

Here are some follow-up questions on the discussion above.

Can you make a list (as text) of the information that you think that's important to convey and why? Structure the list by topology information, trajectory information, and other (maybe people should know about the memory consumption??). You don't need to know at this stage how get the information but just state what you think should be present and why it's useful.

In my opinion Canonical information are those which we need to know every time we start to use a specific Universe.They are essential to have a top view on a universe whatever kind of data it has.I don't see a need for surveying our users, otherwise it's a useless method ,it won't facilitate anything then.

My point is that you are currently choosing what is displayed. You are only one user. Do you know what is important for our thousands of users?

On our end, what @orbeckst and I mean here is that if we're generating a list of attributes that get printed out to the users, this shouldn't be just a long list of print statements but something that programmatically interrogates and outputs the contents of the Universe object.

A simple version is a list of attributes that is queried and values obtained with getattr(Universe, attrname). Probably beter and more scalable is code that directly looks into the topology object and checks which attributes are registered (you'd need to dive into how our topology system works, especially the introduction to the topology system in the User Guide).

Similarly, look at the coordinates/trajectory reader part of the code to see if there's a generic way to get information. For Readers, we define a whole bunch of attributes as mandatory (see Trajectory API) so that might be a good starting point.

AHMED-salah00 commented 1 year ago

@AHMED-salah00 can you summarize the current proposal, please, i.e., and update #4050 (comment) taking the current discussion into account.

We then have a basis to ask users on the mailing list and on discord for input.

Here's the latest update of the proposed Universe.info() method, for the MDAnalysis library. This new feature aims to provide users with quick information about a Universe object. It will make things easier for users to understand the characteristics of the universe without the need to navigate through methods.

The information displayed by info() can be divided into three categories;

Topology Information; This includes details such as the number of atoms, which's crucial for understanding the size of the universe and its impact on calculations performance. Additionally it provides information about the number of residues and segments giving insights into the structure of the universe. The method also lists topology attributes (u._topology.attrs). Provides details on any available data in the universe. If applicable it mentions the number of chains in the system, which helps in understanding its complexity and differentiating between parts of the molecule. Lastly it indicates that file format used to construct the universe aiding users comprehension of its structure.
Trajectory Information; This category covers details like the number of frames in a trajectory and information about timesteps. It gives users an idea, about how a trajectory lasts and what indexing optionsre available. Trajectory format; Provides details regarding the file format utilized for trajectories.
Additional Details (Optional); Memory consumption information; This feature offers insights, into memory usage valuable when considering parallel processing. It can be included as a feature or, in an update.

Regarding the display of topology attributes, it is suggested to consider descriptive statistics (similar to DataFrame.info()) or provide sets for string attributes and averages for numerical ones. Also, the info() method should be flexible enough to handle different ways of creating a universe, including situations where a universe is created without using a file, in-memory data structures, fetch_mmtf().

At this stage, the proposal is being discussed, and further input from MDAnalysis users through surveys, mailing lists, and the Discord channel will be sought to refine the method's final implementation. Once the final details are decided, the proposed Universe.info() method will be integrated into the library as an enhancement to improve the user experience when exploring the universe using MDAnalysis.

richardjgowers commented 1 year ago

@AHMED-salah00 memory footprint is very interesting and useful (especially when thinking of parallelism). I wouldn't actually know exactly what we do use currently. I think it might be complicated, so maybe it could be added as a second PR on the .info() method

orbeckst commented 1 year ago

You seem to have general support to go forward with your idea of an info() method.

Overall your list https://github.com/MDAnalysis/mdanalysis/issues/4050#issuecomment-1460299119 looks sensible to me. Some comments:

"chains" are not always available, though. Anything in https://userguide.mdanalysis.org/stable/universe.html#universe-properties-and-methods is present and it makes sense to report it.

Available topology attributes should be taken from u._topology.attrs. It might be sufficient to just list the ones currently available. (Alternatively, one could report averages for numerical ones and a set for strings but any of this could wait until its clearer how people really want to use info()).

For topology and trajectory files I would also want to see the path + filename. However, how are you going to deal with in-memory universes (MemoryReader) and concatenated trajectories (ChainReader)? What about universes that were created with fetch_mmtf ?

I would also add information about any active on-the-fly transformations.

AHMED-salah00 commented 1 year ago

(Alternatively, one could report averages for numerical ones and a set for strings but any of this could wait until its clearer how people really want to use info()).

What is the approach that followed to know that? I am really interested in knowing more about that. Are we going to make a survey or something?

It's a good idea to include the method of construction for each Universe whether the normal way (using imported topology file etc..), on the fly, or using fetch_mmtf() https://docs.mdanalysis.org/2.0.0/documentation_pages/coordinates/MMTF.html#classes or another way.I am not sure if I get it right but probably if we are going to deal with a universe it wont be reasonable to deal with each universe(which were built differently) in a different way as they have the same type and probably the same data! Or do they really have different characteristics?

orbeckst commented 1 year ago

I am not sure what the best way is to decide what information to display. I was thinking about descriptive statistics (similar to DataFrame.info()) and what I often do, such as set(u.residues.resnames) — but that's just me, so a survey might be useful.

Filenames are pretty important to me and I think most people would see this as useful information (check what repr(u.trajectory) or repr(u._topology), ... currently give. I am just pointing out that you should not assume that a universe is always created from a file. We can also create a universe from an empty data structure (Universe.empty()) and by other means. The info() method should be flexible enough to handle these different situations.

orbeckst commented 1 year ago

@AHMED-salah00 can you summarize the current proposal, please, i.e., and update https://github.com/MDAnalysis/mdanalysis/issues/4050#issuecomment-1460299119 taking the current discussion into account.

We then have a basis to ask users on the mailing list and on discord for input.

orbeckst commented 1 month ago

There's general support for Universe.info() and the current idea from https://github.com/MDAnalysis/mdanalysis/issues/4050#issuecomment-1460299119 is summarized here:

This new feature aims to provide users with quick information about a Universe object. It will make things easier for users to understand the characteristics of the Universe without the need to navigate through methods.

Information

The information displayed by info() can be divided into three categories;

Topology Information: This includes details such as
- the number of atoms, which is crucial for understanding the size of the universe and its impact on calculations performance.
- Additionally it provides information about the number of residues and segments giving insights into the structure of the universe.
- The method also lists topology attributes (u._topology.attrs). Provides details on any available data in the universe.
Regarding the display of topology attributes, it is suggested to consider descriptive statistics (similar to DataFrame.info()) or provide sets for string attributes and averages for numerical ones.

If applicable it mentions the number of chains in the system, which helps in understanding its complexity and differentiating between parts of the molecule.
- file format
- topology file name (or e.g. URL for streaming or a placeholder for MEMORY)
Trajectory Information: It gives users an idea, about how a trajectory lasts and what indexing options are available. This category covers details like
- the number of frames in a trajectory and
- information about timesteps.
- trajectory file format
- trajectory file name(s) (or e.g. URL for streaming or a placeholder for MEMORY)
Additional Details (Optional)
- Memory consumption information; This feature offers insights, into memory usage valuable when considering parallel processing. It can be included as a feature or, in an update.
- Active transformations.
- Active AuxReader.

Implementation considerations

Access to data should be done programmatically, see, e.g., https://github.com/MDAnalysis/mdanalysis/pull/4046#pullrequestreview-1784996690 and the comments https://github.com/MDAnalysis/mdanalysis/issues/4050#issuecomment-1453908206 and https://github.com/MDAnalysis/mdanalysis/issues/4050#issuecomment-1452529706
The info() method should be flexible enough to handle different ways of creating a universe, including situations where a universe is created without using a file, in-memory data structures, streaming.
The implementation could follow a similar pattern to DataFrameInfo https://github.com/MDAnalysis/mdanalysis/issues/4050#issuecomment-1453904669 to separate the information gathering, processing, and rendering from the Universe class.

Other comments

For a demo MDAKit we recently wrote a primitive info-like function in https://github.com/MDAnalysis/mdgeomkit/blob/main/mdgeom/info.py . We used prettytable for printing tables of multiple universes and even though here we don't look at multiple Universes, we could use prettytable (or something similar) as a pretty-printer for info() to programmatically create a list of (variable, datatype, guessed, value, descriptive statistics) and then write out a table (descriptive statistics is a summary quantity for topology attributes that apply to all atoms and may be the number unique elements for int, str, object attributes or mean±stdev for floats).

richardjgowers commented 1 month ago

in terms of implementation, it might make sense to devolve generating the info, so make the Universe.info() method call Reader._get_info() -> dict[str, str] and Topology._get_info() -> dict[str, str] (and maybe AuxReader._get_info() etc), where these dicts map a title/description to the value, e.g. "Filename": "traj.xtc". Then the actual content of the Universe.info() method is nicely formatting that data. This allows each "leg" of a Universe to describe itself, and makes future Readers have flexibility in what they return (e.g. a future URL stream reader isn't restricted by what Universe knows about it).

laksh-krishna-sharma commented 3 weeks ago

May I work on this issue with a little bit of your guidance?

MDAnalysis / mdanalysis