AllenInstitute / AllenSDK

code for reading and processing Allen Institute for Brain Science data
https://allensdk.readthedocs.io/en/latest/
Other
343 stars 149 forks source link

`JSONDecodeError: Extra data: line 12 column 2 (char 2984)` when downloading NWB files for VisualBehavior #2165

Open dougollerenshaw opened 3 years ago

dougollerenshaw commented 3 years ago

Describe the bug I'm getting JSONDecodeError: Extra data: line 12 column 2 (char 2984) when attempting to load NWB files from the VisualBehavior cache.

This occurred after starting a process to download all NWB files to a local directory. I set out to do this to make it possible to do a summary analysis of all experiments. I was hoping to use the NWB files for this analysis in order to ensure that results were consistent with what an external users would get when doing the same analysis.

To Reproduce I did the following to start the download process of all BehaviorOphysExperiment NWB files using 16 cores on my local machine:

import allensdk.brain_observatory.behavior.behavior_project_cache as bpc
from multiprocessing import Pool

data_storage_directory = "/allen/programs/braintv/workgroups/nc-ophys/visual_behavior/production_cache/" # Note: this path must exist on your local drive
cache = bpc.VisualBehaviorOphysProjectCache.from_s3_cache(cache_dir=data_storage_directory)

experiment_table = cache.get_ophys_experiment_table()
oeids = experiment_table.index.values

def open_experiment(oeid):
    print('oeid = {}'.format(oeid))
    cache.get_behavior_ophys_experiment(oeid)

with Pool(16) as pool:
    pool.map(open_experiment, oeids)

Expected behavior I expected this process to take some number of hours to complete. At the end, I expected all NWB files to be in the data_storage_directory defined above

Actual Behavior The process started running as expected. After approximately 20 NWB files had been downloaded, I got the following error:

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-18-8199a0378505> in <module>
      1 oeid = 993891850
----> 2 cache.get_behavior_ophys_experiment(oeid)

~/anaconda3/envs/vba/lib/python3.8/site-packages/allensdk/brain_observatory/behavior/behavior_project_cache/behavior_project_cache.py in get_behavior_ophys_experiment(self, ophys_experiment_id, fixed)
    515         fetch_session = partial(self.fetch_api.get_behavior_ophys_experiment,
    516                                 ophys_experiment_id)
--> 517         return call_caching(
    518             fetch_session,
    519             lambda x: x,  # not writing anything

~/anaconda3/envs/vba/lib/python3.8/site-packages/allensdk/api/warehouse_cache/caching_utilities.py in call_caching(fetch, write, read, pre_write, cleanup, lazy, num_tries, failure_message)
     94         if not lazy or read is None:
     95             logger.info("Fetching data from remote")
---> 96             data = fetch()
     97             if pre_write is not None:
     98                 data = pre_write(data)

~/anaconda3/envs/vba/lib/python3.8/site-packages/allensdk/brain_observatory/behavior/behavior_project_cache/project_apis/data_io/behavior_project_cloud_api.py in get_behavior_ophys_experiment(self, ophys_experiment_id)
    253                                f" there are {row.shape[0]} entries.")
    254         file_id = str(int(row[self.cache.file_id_column]))
--> 255         data_path = self._get_data_path(file_id=file_id)
    256         return BehaviorOphysExperiment.from_nwb_path(str(data_path))
    257 

~/anaconda3/envs/vba/lib/python3.8/site-packages/allensdk/brain_observatory/behavior/behavior_project_cache/project_apis/data_io/behavior_project_cloud_api.py in _get_data_path(self, file_id)
    347             data_path = self._get_local_path(file_id=file_id)
    348         else:
--> 349             data_path = self.cache.download_data(file_id=file_id)
    350         return data_path
    351 

~/anaconda3/envs/vba/lib/python3.8/site-packages/allensdk/api/cloud_cache/cloud_cache.py in download_data(self, file_id)
    621             If the file cannot be downloaded
    622         """
--> 623         super_attributes = self.data_path(file_id)
    624         file_attributes = super_attributes['file_attributes']
    625         self._download_file(file_attributes)

~/anaconda3/envs/vba/lib/python3.8/site-packages/allensdk/api/cloud_cache/cloud_cache.py in data_path(self, file_id)
    592         """
    593         file_attributes = self._manifest.data_file_attributes(file_id)
--> 594         exists = self._file_exists(file_attributes)
    595         local_path = file_attributes.local_path
    596         output = {'local_path': local_path,

~/anaconda3/envs/vba/lib/python3.8/site-packages/allensdk/api/cloud_cache/cloud_cache.py in _file_exists(self, file_attributes)
    560 
    561         if not file_exists:
--> 562             file_exists = self._check_for_identical_copy(file_attributes)
    563 
    564         return file_exists

~/anaconda3/envs/vba/lib/python3.8/site-packages/allensdk/api/cloud_cache/cloud_cache.py in _check_for_identical_copy(self, file_attributes)
    502 
    503         with open(self._downloaded_data_path, 'rb') as in_file:
--> 504             available_files = json.load(in_file)
    505 
    506         matched_path = None

~/anaconda3/envs/vba/lib/python3.8/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    291     kwarg; otherwise ``JSONDecoder`` is used.
    292     """
--> 293     return loads(fp.read(),
    294         cls=cls, object_hook=object_hook,
    295         parse_float=parse_float, parse_int=parse_int,

~/anaconda3/envs/vba/lib/python3.8/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    355             parse_int is None and parse_float is None and
    356             parse_constant is None and object_pairs_hook is None and not kw):
--> 357         return _default_decoder.decode(s)
    358     if cls is None:
    359         cls = JSONDecoder

~/anaconda3/envs/vba/lib/python3.8/json/decoder.py in decode(self, s, _w)
    338         end = _w(s, end).end()
    339         if end != len(s):
--> 340             raise JSONDecodeError("Extra data", s, end)
    341         return obj
    342 

JSONDecodeError: Extra data: line 12 column 2 (char 2984)

Now, simply calling:

oeid = 993891850
cache.get_behavior_ophys_experiment(oeid)

results in the same error as above for any oeid

Environment (please complete the following information):

Additional context I'm assuming that the parallel processing has somehow corrupted the manifest file. Is this true? If so, is there some other way to download the NWB files beyond what I tried above? Should I simply download them in a serial loop and wait however long it takes?

This is also related to a recent forum question (https://community.brain-map.org/t/visual-behavior-optical-physiology/1183), so I suspect external users will run into similar problems if attempting to parallelize the download process.

Do you want to work on this issue? Yes, I'd like to help solve this.

djkapner commented 3 years ago

I think the cache is keeping track of things in a local file, and it wasn't written in such a way that it can handle multiprocess collisions on r/w of that file. @danielsf does that sound right?

danielsf commented 3 years ago

Yes. That is a correct assessment of cloud cache's capabilities, and is very likely what is happening.

dougollerenshaw commented 3 years ago

Got it. So I suppose the answer is just "don't do that", right?

Alternately, is there already a local cache of all NWB files somewhere in the /allen filesystem that would let me accomplish what I was trying to do without having to write a loop to download all NWB files serially?

djkapner commented 3 years ago

yes, "don't do that" :) no, we have not made a local copy of the downloaded cache

danielsf commented 3 years ago

I'm going to post this here, in case we decide we want to add code to support multithreaded downloading in the future. Apparently threadsafety of boto3 is a nontrivial question

https://emasquil.github.io/posts/multithreading-boto3/

(we might be okay because S3CloudCache carries around a boto3 client, which seems to be the threadsafe part of the process)

dougollerenshaw commented 3 years ago

Thanks @danielsf and @djkapner. I'll go ahead and write a loop to download all serially.

Feel free to close this issue if this isn't something you want to deal with, or leave it here as a future to-do if you'd like.