MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.31k stars 648 forks source link

`guess_format` doesn't work with `h5py.File` objects #2884

Open edisj opened 4 years ago

edisj commented 4 years ago

Is your feature request related to a problem?

I'm trying to pass an h5py.File object into mda.Universe using H5MDReader (#2787), but have been running into funky errors and I think the issue is that h5py.File attributes aren't suited for MDAnalysis.lib.util.isstream and MDAnalysis.lib.util.iterable, due to h5py files not passing the isstream method and being classified as "iterable" due to having a len(h5py.File)


If I don't pass format="H5MD:

u = mda.Universe(TPR_xvf, h5py.File(H5MD_xvf, 'r'))

MDAnalysis.lib.util.guess_format() looks at isstream(h5py.File) and iterable(h5py.File, however, these return unexpected values for an h5py.File:

image which makes guess_format think there are multiple trajectories that get passed to MDAnalysis.coordinates.chain.py which THEN gets passed to MDAnalysis.coordinates.core.py which fails and gives ValueError: Unknown coordinate trajectory format '' for 'h5md'., so it looks like it did try to iterate through the groups in the h5py.File: image


If I do pass `format="H5MD":

u = mda.Universe(TPR_xvf, h5py.File(H5MD_xvf, 'r'), format="H5MD")

Again, it ends up calling MDAnalysis.coordinates.chain.py, but then passes it to H5MDReader where it fails with OSError: Unable to open file (unable to open file: name = 'h5md', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0) . So, again, it looks like it tried to load the first group 'h5md' from the file as an individual trajectory

Describe the solution you'd like

I'd like h5py files to also pass the isstream method, even though it doesn't fit the criteria.

Describe alternatives you've considered

I've tried doing something like adding an if clause at the start of MDAnalysis.lib.util.guess_format with something like

import h5py
if isinstance(filename, h5py.File):
    format = format_from_filename_extension(filename.filename)

but the same issues occur, so I'm not sure what to do to fix it.

Additional context

richardjgowers commented 4 years ago

try adding a function like this : https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/coordinates/chemfiles.py#L121

lilyminium commented 4 years ago

It looks like the problem is that it's a false positive for ChainReader, because the format hint for that one checks if the file is iterable. Not sure the best way around this; maybe a special case for h5py in the ChainReader format hint? Alternatively, the iterable function itself

richardjgowers commented 4 years ago

The format hint thing takes priority over chainreader I think. It’s also how numpy arrays don’t fall into it

On Tue, Aug 4, 2020 at 12:27, Lily Wang notifications@github.com wrote:

It looks like the problem is that it's a false positive for ChainReader, because the format hint for that one checks if the file is iterable. Not sure the best way around this; maybe a special case for h5py in the ChainReader format hint?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2884#issuecomment-668540339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGBYUMQMLSDOJQRNYMT3R67WCXANCNFSM4PUCIJ7A .

lilyminium commented 4 years ago

The format hint thing takes priority over chainreader I think. It’s also how numpy arrays don’t fall into it

Not currently, and you wouldn't want it to anyway (although this would have been an easy solution), because what if you have an iterable of chemfiles objects? ChainReader ignores ndarrays specifically:

    @staticmethod
    def _format_hint(thing):
        """Can ChainReader read the object *thing*

        .. versionadded:: 1.0.0
        """
        return (not isinstance(thing, np.ndarray) and
                util.iterable(thing) and
                not util.isstream(thing))
richardjgowers commented 4 years ago

Ah my mistake then, I thought chainreader was in the guess_format thing that happens once all the hints are done. Hmm. I think ChainReader should only happen once all other formats have had a chance to have a go. So maybe ChainReader shouldn't use a _format_hint to avoid this.

edisj commented 4 years ago

@richardjgowers @lilyminium thanks for the replies. I do have a _format_hint defined here in H5MDReader. So is the issue that the h5py.File is being picked up by the ChainReader format hint, which selects ChainReader before it has a chance to be selected by H5MDReader?

richardjgowers commented 4 years ago

Yeah I think chainreader is being a little greedy and we probably should tweak it to be less eager to read anything we pass it.

On Tue, Aug 4, 2020 at 18:27, edisj notifications@github.com wrote:

@richardjgowers https://github.com/richardjgowers @lilyminium https://github.com/lilyminium thanks for the replies. I do have a _format_hint is defined here https://github.com/MDAnalysis/mdanalysis/blob/840a977294311555232114fe998163ce53a88716/package/MDAnalysis/coordinates/H5MD.py#L504-L509 in H5MDReader. So is the issue that the h5py.File is being picked up by the ChainReader format hint, which selects ChainReader before it has a chance to be selected by H5MDReader?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2884#issuecomment-668726353, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB2GK23DQOMSEUPEHN3R7BAITANCNFSM4PUCIJ7A .

orbeckst commented 4 years ago

Should we make ChainReader a special case in the format analysis?

Or should we add a "priority" attribute to each format hint along the lines that everything with the same priority can be tested in any order but low priority are guaranteed to be tested after higher priorities? We then just have to agree that ChainReader is, say, 0, and others are higher, say 10. Not sure if other special cases such as numpy arrays need to be higher/lower than the others.

orbeckst commented 6 months ago

@ljwoods2 is this issue related to the guessing issue you're having for zarrtraj?