NeurodataWithoutBorders / pynwb

A Python API for working with Neurodata stored in the NWB Format
https://pynwb.readthedocs.io
Other
178 stars 84 forks source link

Can't read from one file then save to another file #668

Open VBaratham opened 6 years ago

VBaratham commented 6 years ago

1) Bug

Loading an NWBFile from one file, then saving it to another causes ValueError: Can't change container_source once set to be thrown

Steps to Reproduce

from datetime import datetime                                                                                         
from pynwb import NWBFile, NWBHDF5IO, TimeSeries                                                                   

if __name__ == '__main__':                                                                                            
    nwb = NWBFile(source='', session_description='', identifier='', session_start_time=datetime.now())                                                                                    
    ts1 = TimeSeries(name='timeseries1', source='source1', data=[4,5,6],                                              
                     unit='na1', rate=1.0, starting_time=0.0)                                                         
    nwb.add_acquisition(ts1)                                                                                          
    with NWBHDF5IO('test_append.nwb', mode='w') as io:                                                                
        io.write(nwb)                                                                                                 

    f = NWBHDF5IO('test_append.nwb')                                                                                  
    nwb = f.read()                                                                                                    
    ts2 = TimeSeries(name='timeseries2', source='source2', data=[4,5,6],                                              
                     unit='na1', rate=1.0, starting_time=0.0)                                                         
    nwb.add_acquisition(ts2)                                                                                          
    with NWBHDF5IO('test_append2.nwb', mode='a') as io:                                                               
        io.write(nwb)  

Environment

Please describe your environment according to the following bullet points.

Checklist

bendichter commented 5 years ago

@VBaratham

You can write an external link to another object by using the manager argument of NWBHDF5IO. This writes an external link to a TimeSeries object in acquisition:

from datetime import datetime
from pynwb import NWBFile, NWBHDF5IO, TimeSeries, get_manager

manager = get_manager()

nwb = NWBFile(session_description='aa', identifier='aa', session_start_time=datetime.now())
ts1 = TimeSeries(name='timeseries1', data=[4, 5, 6],
                 unit='na1', rate=1.0, starting_time=0.0)
nwb.add_acquisition(ts1)
with NWBHDF5IO('test_append.nwb', mode='w', manager=manager) as io:
    io.write(nwb)

with NWBHDF5IO('test_append.nwb', 'r', manager=manager) as f:
    nwb_r = f.read()
    nwb = NWBFile(session_description='aa', identifier='aa', session_start_time=datetime.now())
    nwb.add_acquisition(nwb_r.acquisition['timeseries1'])
    with NWBHDF5IO('test_append2.nwb', mode='w', manager=manager) as io:
        io.write(nwb)

What you describe- linking an entire NWBFile object- would at its simplest look like this:

from datetime import datetime
from pynwb import NWBFile, NWBHDF5IO, get_manager

manager = get_manager()

nwb = NWBFile(session_description='aa', identifier='aa', session_start_time=datetime.now())
with NWBHDF5IO('file1.nwb', mode='w', manager=manager) as io:
    io.write(nwb)
with NWBHDF5IO('file1.nwb', 'r', manager=manager) as f:
    nwb = f.read()
    with NWBHDF5IO('file2.nwb', mode='w', manager=manager) as io:
        io.write(nwb)

This would instruct pynwb to write the entire file as an external link to another file. You are correct that this leads to an error, but is this really what you are trying to do?

VBaratham commented 5 years ago

Hi @bendichter

I think I'm a bit lost. Here's what I want to do:

I have a file called test.nwb. I want to read that in, add some data or something, and then save it out to a file called test2.nwb so that I have two nwb files side by side, the first of which (test.nwb) is missing this added dataset. I don't want the second file to be a link to the first, I want it to have all the data replicated. Is this possible? Or do I have to use some other tool to duplicate the file before adding to one of them (would that work)?

oruebel commented 5 years ago

@VBaratham by default PyNWB will in this case create external links in test2.nwb to the containers that have previously been written to test.nwb. If you want all the data from test.nwb to be copied to test2.nwb then simply set link_data=False on write which will change the default behavior to copy the data rather than link it. I.e,

with NWBHDF5IO('file2.nwb', mode='w', manager=manager) as io:
        io.write(nwb, link_data=False)
bendichter commented 5 years ago

Thanks for the clarification, Oliver. We should add that to the tutorials

oruebel commented 5 years ago

@bendichter what do you think the best place for this is. Maybe in the modular storage tutorial:

https://pynwb.readthedocs.io/en/stable/tutorials/general/linking_data.html#sphx-glr-tutorials-general-linking-data-py

bendichter commented 5 years ago

Yeah or maybe in advanced IO

t-b commented 5 years ago
~/devel/pynwb/docs/gallery/domain (master) $ git diff .
diff --git a/docs/gallery/domain/icephys.py b/docs/gallery/domain/icephys.py
index 965270ed..82a07c3b 100644
--- a/docs/gallery/domain/icephys.py
+++ b/docs/gallery/domain/icephys.py
@@ -150,3 +150,11 @@ elec = ccss.electrode
 # And the device name via :py:meth:`~pynbwb.file.NWBFile.get_device`

 device = nwbfile.get_device('Heka ITC-1600')
+
+io.close()
+
+io = NWBHDF5IO('icephys_example.nwb', 'w')
+nwbfile = io.read()
+nwbfile.add_acquisition(vcs)
+io.write(nwbfile)
+io.close()
~/devel/pynwb/docs/gallery/domain (master) $ python icephys.py
Traceback (most recent call last):
  File "icephys.py", line 157, in <module>
    nwbfile = io.read()
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/utils.py", line 381, in func_call
    return func(self, **parsed['args'])
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/backends/io.py", line 33, in read
    container = self.__manager.construct(f_builder)
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/utils.py", line 381, in func_call
    return func(self, **parsed['args'])
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/build/map.py", line 195, in construct
    result = self.__type_map.construct(builder, self)
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/utils.py", line 381, in func_call
    return func(self, **parsed['args'])
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/build/map.py", line 1561, in construct
    attr_map = self.get_map(builder)
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/utils.py", line 381, in func_call
    return func(self, **parsed['args'])
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/build/map.py", line 1492, in get_map
    container_cls = self.get_cls(obj)
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/utils.py", line 381, in func_call
    return func(self, **parsed['args'])
  File "/home/firma/.local/lib/python2.7/site-packages/hdmf/build/map.py", line 1424, in get_cls
    raise ValueError("No data_type found for builder %s" % builder.path)
ValueError: No data_type found for builder root
~/devel/pynwb/docs/gallery/domain (master) $ git rev-parse HEAD
8f472d6c5b74a72e57073fca4c8100a60dfc16a1
~/devel/pynwb/docs/gallery/domain (master) $

This has nothing to do with external links.

rly commented 5 years ago

@t-b in your example, you are opening the file in write mode but then reading from it, which is not allowed. We should check if that happens and throw a user-friendly error if so.

If you change 'w' to 'r+', then you get the correct error:

Traceback (most recent call last):
  File "icephys.py", line 159, in <module>
    nwbfile.add_acquisition(vcs)
  File "c:\users\ryan\documents\nwb\hdmf\src\hdmf\utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "c:\users\ryan\documents\nwb\pynwb\src\pynwb\file.py", line 617, in add_acquisition
    self._add_acquisition_internal(nwbdata)
  File "c:\users\ryan\documents\nwb\hdmf\src\hdmf\utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "c:\users\ryan\documents\nwb\pynwb\src\pynwb\core.py", line 743, in _func
    raise ValueError(msg)
ValueError: 'vcs' already exists in 'root'
t-b commented 5 years ago

@rly Thanks, granted this was sloopy.

Next try:

~/devel/pynwb (better-validation) $ git diff .
diff --git a/docs/gallery/domain/icephys.py b/docs/gallery/domain/icephys.py
index bfff2ed6..448f529a 100644
--- a/docs/gallery/domain/icephys.py
+++ b/docs/gallery/domain/icephys.py
@@ -150,3 +150,22 @@ elec = ccss.electrode
 # And the device name via :py:meth:`~pynbwb.file.NWBFile.get_device`

 device = nwbfile.get_device('Heka ITC-1600')
+
+io.close()
+
+ioread = NWBHDF5IO('icephys_example.nwb', 'r')
+
+vcs2 = VoltageClampSeries(
+    name='vcs2', data=[0.1, 0.2, 0.3, 0.4, 0.5],
+    unit='A', conversion=1e-12, resolution=np.nan, starting_time=123.6, rate=20e3,
+    electrode=elec, gain=0.02, capacitance_slow=100e-12, resistance_comp_correction=70.0,
+    capacitance_fast=np.nan, resistance_comp_bandwidth=np.nan, resistance_comp_prediction=np.nan,
+    whole_cell_capacitance_comp=np.nan, whole_cell_series_resistance_comp=np.nan)
+
+nwbfile = ioread.read()
+nwbfile.add_acquisition(vcs2)
+
+io = NWBHDF5IO('icephys_example2.nwb', 'w')
+io.write(nwbfile)
+io.close()
+ioread.close()
~/devel/pynwb (better-validation) $ python docs/gallery/domain/icephys.py
Traceback (most recent call last):
  File "docs/gallery/domain/icephys.py", line 169, in <module>
    io.write(nwbfile)
  File "/home/firma/devel/hdmf/src/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/home/firma/devel/hdmf/src/hdmf/backends/hdf5/h5tools.py", line 196, in write
    call_docval_func(super(HDF5IO, self).write, kwargs)
  File "/home/firma/devel/hdmf/src/hdmf/utils.py", line 281, in call_docval_func
    return func(*fargs, **fkwargs)
  File "/home/firma/devel/hdmf/src/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/home/firma/devel/hdmf/src/hdmf/backends/io.py", line 39, in write
    f_builder = self.__manager.build(container, source=self.__source)
  File "/home/firma/devel/hdmf/src/hdmf/utils.py", line 388, in func_call
    return func(self, **parsed['args'])
  File "/home/firma/devel/hdmf/src/hdmf/build/map.py", line 158, in build
    raise ValueError("Can't change container_source once set")
ValueError: Can't change container_source once set
~/devel/pynwb (better-validation) $
luiztauffer commented 5 years ago

@oruebel @bendichter The problem seems to persist, even with link_data=False. Here's a reproducible error:

from datetime import datetime
from dateutil.tz import tzlocal
from pynwb import NWBFile, NWBHDF5IO, get_manager, TimeSeries
import nwbext_ecog
import numpy as np
import os

manager = get_manager()

# Creates file 1
nwb_1 = NWBFile(session_description='session', identifier='1', session_start_time=datetime.now(tzlocal()))
# Add signal
X_data = np.zeros((10,1000))
timestamps = list(range(1000))
test_ts = TimeSeries(name='test_timeseries', data=X_data, unit='s', timestamps=timestamps)
nwb_1.add_acquisition(test_ts)
with NWBHDF5IO('file_1.nwb', mode='w') as io:
    io.write(nwb_1)

# Open file 1
with NWBHDF5IO('file_1.nwb', 'r', manager=manager) as io_1:
    nwb_1 = io_1.read()
    # Open file 2
    with NWBHDF5IO('file_2.nwb', mode='w', manager=manager) as io_2:
        nwb_2 = NWBFile(session_description='session', identifier='2', session_start_time=datetime.now(tzlocal()))
        for aux in nwb_1.acquisition:      # Acquisition
            nwb_2.add_acquisition(nwb_1.get_acquisition(aux))
        io_2.write(nwb_2, link_data=False)
        print(nwb_2.acquisition['test_timeseries'])

file_2.nwb was created and apparently has something in the acquisition field. However, the file sizes are different and, when I delete file_1.nwb, reading from file_2.nwb raises and error.

# Open file 2 after deleting file 1
with NWBHDF5IO('file_2.nwb', 'r', manager=manager) as io:
    nwb_2 = io.read()
    print(nwb_2.acquisition['test_timeseries'])
AttributeError: 'NoneType' object has no attribute 'name'
rly commented 5 years ago

re: @oruebel 's comment earlier about adding link_data=False. That flag applies only to datasets. We do not support copying selected containers just yet, so yes, @luiztauffer 's example fails because the 'test_timeseries' TimeSeries is a link which is broken after file 1 is deleted.

https://github.com/hdmf-dev/hdmf/pull/108 will be able to help with that, but this will take a little time to tackle.

FYI: a full copy of a file is supported by HDF5IO.copy_file.

NileGraddis commented 4 years ago

EDIT @rly

I am running into the same set of problems (container source, default linking) while attempting to carry out the same read -> modify -> write separately workflow. As a workaround, I can:

  1. read the file into a BytesIO
  2. make an h5py.File from the BytesIO
  3. make an NWBDHF5IO from the h5py.File
  4. read and modify the NWBFile
  5. traverse the NWBFile, setting container_source to the repr of the BytesIO (:radioactive:)
  6. write and close everything in reverse order

This is pretty gross. Is there a better way? What is the actual purpose of container_source if it is being reset on each write?

If there is not a better way currently, when do you think these issues could be addressed? We are using this functionality (via workaround) to process data we hope to release in June and it would be great if we could have clean code in place to process those data.

oruebel commented 4 years ago

@NileGraddis would this https://pynwb.readthedocs.io/en/stable/tutorials/general/scratch.html#copying-an-nwb-file workflow work for you?

NileGraddis commented 4 years ago

@oruebel

Thanks for the quick reply! I think that is almost there, except that I would like the copy to be deep rather than shallow - we're using this in a processing pipeline which incrementally adds to an nwb file which ought then to stand alone.

oruebel commented 4 years ago

@NileGraddis if you want a full physical copy of the file, then you could just copy the file first and then open the copy. HDF5IO.copy_file can do this (which is useful if you need to resolve external links) or you could just do standard python

from shutil import copyfile
copyfile(src, dst)

to copy the file.

NileGraddis commented 4 years ago

@oruebel

Copying the file is certainly another workaround, though it requires that the intended output is an h5 file on disk, which might not be true in the future!

I suppose that the main concern I have about my workaround is container_source - I'm not thrilled to rely on updating a private attribute, which might be changed in the future! I am also a bit confused about the purpose of container_source. It seems like it might represent:

  1. the original source (provenance tracking!) of the container. But then why does NWBHDF5IO try (and fail) to reset it on write?
  2. some information about the "ancestral" source of a container (like: "I am working with data from a bunch of files, which one did this container come from"). But this is redundant with parent and doesn't explain why it is read-only.

What is the intent behind container_source? Would it make sense to make a simple modification to the NWBDHF5IO that would resolve this issue, like only trying to set container_source if it is not already set?

kir0ul commented 4 years ago

Hi, I have the same use case: I would like to copy a few data interfaces below I get in an NWB file from Suite2p to another main NWB file.

ipdb> nwb2p.processing                                                                                                                                                              
{'ophys': ophys pynwb.base.ProcessingModule at 0x140110771702800
Fields:
  data_interfaces: {
    Backgrounds_0 <class 'pynwb.base.Images'>,
    Backgrounds_1 <class 'pynwb.base.Images'>,
    Backgrounds_2 <class 'pynwb.base.Images'>,
    Deconvolved <class 'pynwb.ophys.Fluorescence'>,
    Fluorescence <class 'pynwb.ophys.Fluorescence'>,
    ImageSegmentation <class 'pynwb.ophys.ImageSegmentation'>,
    Neuropil <class 'pynwb.ophys.Fluorescence'>
  }
  description: optical physiology processed data
}

As long as the NWB object is in memory, everything is fine, but when I try to write it on disk, I get the same error with the container_source referencing the old NWB file. AIso the problem with copying as suggested here is that I only want some parts of the file. For now I ended up modifying the _AbstractContainer__container_source attribute of each data interface, which is probably not the way to go.

oruebel commented 4 years ago

https://github.com/hdmf-dev/hdmf/pull/326 should address this issue

kir0ul commented 4 years ago

It seems that this has been fixed, has it? I can find an export method in the HDMF docs but I can't find any NWBHDF5IO.export() as written here.

rly commented 4 years ago

If you use the latest PyNWB with the latest HDMF, then you can call the export method on NWBHDF5IO. Technically conda and pypi will say that the latest PyNWB is not compatible with the latest HDMF. This will be fixed in the next release, so I'll close this issue then.

Removal of objects from an in-memory NWBFile before exporting is not supported in PyNWB yet, but will be supported in an upcoming release.