flatironinstitute / neuropixels-data-sep-2020

Example neuropixels datasets for purposes of developing spike sorting algorithms
Apache License 2.0
9 stars 6 forks source link

Download of the file #12

Closed yger closed 4 years ago

yger commented 4 years ago

I downloaded the file cortexlab phase 3 for several hours, but sadly the download ended up with this error, despite the fact that no network interruption were noticeable

Loaded 12000000 of 20000000 bytes from 9beb5d (60.0 %): sha1://29c76d52197791a4065dde95f8639cdbea558777?chunkOf=1b8592f0240603ae1019379cb47bad6475503aaf~28880000000~28900000000 Error loading file: Download failed.: sha1://29c76d52197791a4065dde95f8639cdbea558777?chunkOf=1b8592f0240603ae1019379cb47bad6475503aaf~28880000000~28900000000 Elapsed time: 20224.59192246478

:219: ResourceWarning: unclosed ResourceWarning: Enable tracemalloc to get the object allocation traceback :219: ResourceWarning: unclosed ResourceWarning: Enable tracemalloc to get the object allocation traceback /usr/lib/python3/dist-packages/apport/report.py:13: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import fnmatch, glob, traceback, errno, sys, atexit, locale, imp Traceback (most recent call last): File "test.py", line 46, in raise error File "test.py", line 42, in se.BinDatRecordingExtractor.write_recording(recording, 'recordings/cortexlab-single-phase-3.dat') File "/home/pierre/github/spikeextractors/spikeextractors/extractors/bindatrecordingextractor/bindatrecordingextractor.py", line 128, in write_recording write_to_binary_dat_format(recording, save_path, time_axis=time_axis, dtype=dtype, chunk_size=chunk_size) File "/home/pierre/github/spikeextractors/spikeextractors/extraction_tools.py", line 328, in write_to_binary_dat_format traces = recording.get_traces(start_frame=i * chunk_size, File "/home/pierre/github/neuropixels-data-sep-2020/neuropixels_data_sep_2020/extractors/labboxephysrecordingextractor.py", line 282, in get_traces return self._recording.get_traces(channel_ids=channel_ids, start_frame=start_frame, end_frame=end_frame) File "/home/pierre/github/neuropixels-data-sep-2020/neuropixels_data_sep_2020/extractors/binextractors/bin1recordingextractor.py", line 46, in get_traces buf = kp.load_bytes(self._raw, start=i1, end=i2, p2p=self._p2p) File "/home/pierre/.local/lib/python3.8/site-packages/kachery_p2p/core.py", line 252, in load_bytes a = load_bytes( File "/home/pierre/.local/lib/python3.8/site-packages/kachery_p2p/core.py", line 263, in load_bytes path = load_file(uri=uri, from_node=from_node, from_channel=from_channel) File "/home/pierre/.local/lib/python3.8/site-packages/kachery_p2p/core.py", line 129, in load_file raise LoadFileError(f'Error loading file: {r["error"]}: {uri}') kachery_p2p.exceptions.LoadFileError: Error loading file: Download failed.: sha1://29c76d52197791a4065dde95f8639cdbea558777?chunkOf=1b8592f0240603ae1019379cb47bad6475503aaf~28880000000~28900000000 Is it expected? My downloaded file is now 112G. How big should the file be exactly? Moreover, in theory, this recording should have 384 channels at 30kHz am I right? However, also when I used recording.save_to_probe(...), this saves a file with 374 channels. Is it normal? What would you recommend to download/launch a sorter on the data?
magland commented 4 years ago

The download is resumable so you should be able restart and it will pick up where it left off. It downloads individual 10MB chunks.

The info about the data is here (link from the repo): http://ephys1.laboratorybox.org/default/recording/cortexlab-single-phase-3?feed=sha1://0e21c51ee33df3921049bdee7c79fe271aefe746/feed.json

So it is 374 channels selected from 384 raw file.

62.89376 minutes at 30 kHz gives 62.89376 60 30000 384 2 / 1e9 = around 87 GiB

Regarding launching a sorter on it, it depends on the sorter I guess. But yeah, I expect it should work.

I've also created this script (with help from A. Morley) which may be useful for you: https://github.com/flatironinstitute/neuropixels-data-sep-2020/blob/master/scripts/download_recordings.py

magland commented 4 years ago

You could also try out sorting on a smaller, say 20 minute section (you already have that data downloaded, presumably) by using a se.SubRecordingExtractor, as in the example script

yger commented 4 years ago

Good, thanks for the tip! Indeed, everything is cached in the kachery folder, I can see that now. I was afraid of resuming. The data are not filtered I guess? Thanks, I hope I'll stop bothering you now

magland commented 4 years ago

Thanks! Don't hesitate to report any other issues, this is very helpful for us (debugging) since everything is at early stage.

yger commented 4 years ago

And just to understand: you are building is a way to automatically visualize the results of the sorting in a browser? This is I guess what the kachery feeds are for, am I right? But for the workshop I think I'll stick to getting the data (already long enough, if I want to get all the datasets) and sorting them. My concern is just that how we can standardize the visualization among sorters... Should I present the results in a particular format, or just do whatever I can and we'll discuss them alltogether?

magland commented 4 years ago

Do whatever you would normally do (we don't have enough viz yet to anyway). But if you do get a sorting, and you think it would be helpful to share, then you could also share with the repo as well. I can provide instructions if you end up with a sorting in a SI sorting extractor.

alexmorley commented 4 years ago

Hey. Have the same thing but it continuously crashes less than a minute after starting download:

alexmorley@alex-XPS-15:~/git_repos/neuropixels-data-sep-2020$ KACHERY_STORAGE_DIR=/tmp python scripts/download_recordings.py 
Creating (1 of 1): recordings/allen_mouse415148_probeE.dat
Traceback (most recent call last):
  File "scripts/download_recordings.py", line 22, in <module>
    recording = nd.load_recording(recording_id, download=True)
  File "./neuropixels_data_sep_2020/recordings.py", line 36, in load_recording
    recording = LabboxEphysRecordingExtractor(uri, download=download)
  File "./neuropixels_data_sep_2020/extractors/labboxephysrecordingextractor.py", line 222, in __init__
    self._recording: se.RecordingExtractor = Bin1RecordingExtractor(**data, p2p=True, download=download)
  File "./neuropixels_data_sep_2020/extractors/binextractors/bin1recordingextractor.py", line 21, in __init__
    kp.load_file(self._raw)
  File "/home/alexmorley/anaconda3/envs/MYENV/lib/python3.8/site-packages/kachery_p2p/core.py", line 129, in load_file
    raise LoadFileError(f'Error loading file: {r["error"]}: {uri}')
kachery_p2p.exceptions.LoadFileError: Error loading file: File found, but no providers passed test load.: sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce

^ maybe a server issue?

alexmorley@alex-XPS-15:~/git_repos/neuropixels-data-sep-2020$ KACHERY_STORAGE_DIR=/tmp python scripts/download_recordings.py 
Creating (1 of 1): recordings/allen_mouse415148_probeE.dat
Loaded 1280000000 of 80640000000 bytes (1.6 %): sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce
Loaded 2560300000 of 80640000000 bytes from f64510 (3.2 %): sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce
Loaded 2562100000 of 80640000000 bytes from f64510 (3.2 %): sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce
Traceback (most recent call last):
  File "scripts/download_recordings.py", line 22, in <module>
    recording = nd.load_recording(recording_id, download=True)
  File "./neuropixels_data_sep_2020/recordings.py", line 36, in load_recording
    recording = LabboxEphysRecordingExtractor(uri, download=download)
  File "./neuropixels_data_sep_2020/extractors/labboxephysrecordingextractor.py", line 222, in __init__
    self._recording: se.RecordingExtractor = Bin1RecordingExtractor(**data, p2p=True, download=download)
  File "./neuropixels_data_sep_2020/extractors/binextractors/bin1recordingextractor.py", line 21, in __init__
    kp.load_file(self._raw)
  File "/home/alexmorley/anaconda3/envs/MYENV/lib/python3.8/site-packages/kachery_p2p/core.py", line 129, in load_file
    raise LoadFileError(f'Error loading file: {r["error"]}: {uri}')
kachery_p2p.exceptions.LoadFileError: Error loading file: Download failed.: sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce

This one pops up more often.

magland commented 4 years ago

Must be a network connectivity problem... we've only really tested in a few locations (USA). And it uses some UDP tricks for the P2P communication. So, I think the best course is for me to make these available on a node that has a reliable public TCP port. I expect things should work better at that point. Will start putting the data there and I'll let you know when it is ready.

alexmorley commented 4 years ago

OK. In the mean time I can try from a box that I have on the west coast and see if that helps.

magland commented 4 years ago

Sounds good. I now have a linode instance with a reliable public TCP port open that is currently downloading the data (at a pretty brisk pace). So I expect that, as the download proceeds, things will be failing less often (hopefully a lot less often). The first dataset it is downloading is the cortexlab-single-phase-3, and then it will proceed in order down the list. I'll update you when all the data is available in this way.

Tagging: @jsoules

yger commented 4 years ago

Again, I'm having these FileNotFound errors, such that the download is crashing before completion (both for cortexlab phase3 and allen data). I resumed the download, but still. Side question: the resume is deleting the file, then recopying already downloaded chunks from the kachery folder, as far as I understand. However, is this folder wiped out if the machine is rebooted? The kachery Daemon seems to run fine, so I don't know what to do ...

magland commented 4 years ago

@yger, Please pull the latest version of this repo, and rerun the pip install -e . to make sure everything is up-to-date.

Which script are your using for download? Are you using the one in scripts/download_recordings.py?

If you could try it again, please share the last part of the console error message on both the k-p2p daemon and on your python log.

To answer your question about data chunks... those are stored in the $KACHERY_STORAGE_DIR. If that is in your /tmp directory, then it might get erased when you restart the computer. Otherwise those should persist (downloads should resume).

But I did just this morning update the manifest for one of the recordings (allen_mouse419112_probeE) so I would expect that particular one to restart the download.

Now that the data is on multiple computers including one with open TCP port, I would be surprised if you still have a download error. But please report if you do.

yger commented 4 years ago

Thanks. I've updated and relaunched. I'm using the script you gave me, download_recordings.py

magland commented 4 years ago

Okay, fingers crossed.

mhhennig commented 4 years ago

My 2p...here in Edinburgh Informatics we have a rather "special" setup, I think there's some heavy traffic control. For instance I'm basically unable to interact with Dandi as well, everything times out quickly (they run on an AWS S3 bucket).

As for kachery, I cannot even get the daemon to start (terminates with Unexpected identifier). I think problems related to timeouts are due traffic filtering/shaping, may be worth checking wit your local support people (here they are reluctant to admit there's a problem, but I will eventually convince them!).

Fortunately I have a machine outside this super-secure zone, and there the download works perfect.

yger commented 4 years ago

In my case, everything is working now, I am able to get the files!

magland commented 4 years ago

@yger, glad it's finally working, thanks for helping us to troubleshoot.

@mhhennig , I hope that everyone will have access to the data. I did add an accessible node to the "swarm" or "channel" that I believe only requires an outgoing TCP connection. Although who knows, maybe you are blocked in more severe ways than I thought.

But the Unexpected Identifier problem is probably related to your version of NodeJS or your OS. Maybe we can troubleshoot that. Did you use a conda env?

mhhennig commented 4 years ago

But the Unexpected Identifier problem is probably related to your version of NodeJS or your OS. Maybe we can troubleshoot that. Did you use a conda env?

Thanks, yes, that was the problem! Made a fresh env, now it is working...