Closed yger closed 4 years ago
The download is resumable so you should be able restart and it will pick up where it left off. It downloads individual 10MB chunks.
The info about the data is here (link from the repo): http://ephys1.laboratorybox.org/default/recording/cortexlab-single-phase-3?feed=sha1://0e21c51ee33df3921049bdee7c79fe271aefe746/feed.json
So it is 374 channels selected from 384 raw file.
62.89376 minutes at 30 kHz gives 62.89376 60 30000 384 2 / 1e9 = around 87 GiB
Regarding launching a sorter on it, it depends on the sorter I guess. But yeah, I expect it should work.
I've also created this script (with help from A. Morley) which may be useful for you: https://github.com/flatironinstitute/neuropixels-data-sep-2020/blob/master/scripts/download_recordings.py
You could also try out sorting on a smaller, say 20 minute section (you already have that data downloaded, presumably) by using a se.SubRecordingExtractor, as in the example script
Good, thanks for the tip! Indeed, everything is cached in the kachery folder, I can see that now. I was afraid of resuming. The data are not filtered I guess? Thanks, I hope I'll stop bothering you now
Thanks! Don't hesitate to report any other issues, this is very helpful for us (debugging) since everything is at early stage.
And just to understand: you are building is a way to automatically visualize the results of the sorting in a browser? This is I guess what the kachery feeds are for, am I right? But for the workshop I think I'll stick to getting the data (already long enough, if I want to get all the datasets) and sorting them. My concern is just that how we can standardize the visualization among sorters... Should I present the results in a particular format, or just do whatever I can and we'll discuss them alltogether?
Do whatever you would normally do (we don't have enough viz yet to anyway). But if you do get a sorting, and you think it would be helpful to share, then you could also share with the repo as well. I can provide instructions if you end up with a sorting in a SI sorting extractor.
Hey. Have the same thing but it continuously crashes less than a minute after starting download:
alexmorley@alex-XPS-15:~/git_repos/neuropixels-data-sep-2020$ KACHERY_STORAGE_DIR=/tmp python scripts/download_recordings.py
Creating (1 of 1): recordings/allen_mouse415148_probeE.dat
Traceback (most recent call last):
File "scripts/download_recordings.py", line 22, in <module>
recording = nd.load_recording(recording_id, download=True)
File "./neuropixels_data_sep_2020/recordings.py", line 36, in load_recording
recording = LabboxEphysRecordingExtractor(uri, download=download)
File "./neuropixels_data_sep_2020/extractors/labboxephysrecordingextractor.py", line 222, in __init__
self._recording: se.RecordingExtractor = Bin1RecordingExtractor(**data, p2p=True, download=download)
File "./neuropixels_data_sep_2020/extractors/binextractors/bin1recordingextractor.py", line 21, in __init__
kp.load_file(self._raw)
File "/home/alexmorley/anaconda3/envs/MYENV/lib/python3.8/site-packages/kachery_p2p/core.py", line 129, in load_file
raise LoadFileError(f'Error loading file: {r["error"]}: {uri}')
kachery_p2p.exceptions.LoadFileError: Error loading file: File found, but no providers passed test load.: sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce
^ maybe a server issue?
alexmorley@alex-XPS-15:~/git_repos/neuropixels-data-sep-2020$ KACHERY_STORAGE_DIR=/tmp python scripts/download_recordings.py
Creating (1 of 1): recordings/allen_mouse415148_probeE.dat
Loaded 1280000000 of 80640000000 bytes (1.6 %): sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce
Loaded 2560300000 of 80640000000 bytes from f64510 (3.2 %): sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce
Loaded 2562100000 of 80640000000 bytes from f64510 (3.2 %): sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce
Traceback (most recent call last):
File "scripts/download_recordings.py", line 22, in <module>
recording = nd.load_recording(recording_id, download=True)
File "./neuropixels_data_sep_2020/recordings.py", line 36, in load_recording
recording = LabboxEphysRecordingExtractor(uri, download=download)
File "./neuropixels_data_sep_2020/extractors/labboxephysrecordingextractor.py", line 222, in __init__
self._recording: se.RecordingExtractor = Bin1RecordingExtractor(**data, p2p=True, download=download)
File "./neuropixels_data_sep_2020/extractors/binextractors/bin1recordingextractor.py", line 21, in __init__
kp.load_file(self._raw)
File "/home/alexmorley/anaconda3/envs/MYENV/lib/python3.8/site-packages/kachery_p2p/core.py", line 129, in load_file
raise LoadFileError(f'Error loading file: {r["error"]}: {uri}')
kachery_p2p.exceptions.LoadFileError: Error loading file: Download failed.: sha1://39ae3fcccd3803170dd97fc9a8799e7169214419/continuous.dat?manifest=31942d7d97ff3a46fa1dbca72d8dc048bd65d5ce
This one pops up more often.
Must be a network connectivity problem... we've only really tested in a few locations (USA). And it uses some UDP tricks for the P2P communication. So, I think the best course is for me to make these available on a node that has a reliable public TCP port. I expect things should work better at that point. Will start putting the data there and I'll let you know when it is ready.
OK. In the mean time I can try from a box that I have on the west coast and see if that helps.
Sounds good. I now have a linode instance with a reliable public TCP port open that is currently downloading the data (at a pretty brisk pace). So I expect that, as the download proceeds, things will be failing less often (hopefully a lot less often). The first dataset it is downloading is the cortexlab-single-phase-3, and then it will proceed in order down the list. I'll update you when all the data is available in this way.
Tagging: @jsoules
Again, I'm having these FileNotFound errors, such that the download is crashing before completion (both for cortexlab phase3 and allen data). I resumed the download, but still. Side question: the resume is deleting the file, then recopying already downloaded chunks from the kachery folder, as far as I understand. However, is this folder wiped out if the machine is rebooted? The kachery Daemon seems to run fine, so I don't know what to do ...
@yger, Please pull the latest version of this repo, and rerun the pip install -e .
to make sure everything is up-to-date.
Which script are your using for download? Are you using the one in scripts/download_recordings.py?
If you could try it again, please share the last part of the console error message on both the k-p2p daemon and on your python log.
To answer your question about data chunks... those are stored in the $KACHERY_STORAGE_DIR. If that is in your /tmp directory, then it might get erased when you restart the computer. Otherwise those should persist (downloads should resume).
But I did just this morning update the manifest for one of the recordings (allen_mouse419112_probeE) so I would expect that particular one to restart the download.
Now that the data is on multiple computers including one with open TCP port, I would be surprised if you still have a download error. But please report if you do.
Thanks. I've updated and relaunched. I'm using the script you gave me, download_recordings.py
Okay, fingers crossed.
My 2p...here in Edinburgh Informatics we have a rather "special" setup, I think there's some heavy traffic control. For instance I'm basically unable to interact with Dandi as well, everything times out quickly (they run on an AWS S3 bucket).
As for kachery, I cannot even get the daemon to start (terminates with Unexpected identifier
). I think problems related to timeouts are due traffic filtering/shaping, may be worth checking wit your local support people (here they are reluctant to admit there's a problem, but I will eventually convince them!).
Fortunately I have a machine outside this super-secure zone, and there the download works perfect.
In my case, everything is working now, I am able to get the files!
@yger, glad it's finally working, thanks for helping us to troubleshoot.
@mhhennig , I hope that everyone will have access to the data. I did add an accessible node to the "swarm" or "channel" that I believe only requires an outgoing TCP connection. Although who knows, maybe you are blocked in more severe ways than I thought.
But the Unexpected Identifier problem is probably related to your version of NodeJS or your OS. Maybe we can troubleshoot that. Did you use a conda env?
But the Unexpected Identifier problem is probably related to your version of NodeJS or your OS. Maybe we can troubleshoot that. Did you use a conda env?
Thanks, yes, that was the problem! Made a fresh env, now it is working...
I downloaded the file cortexlab phase 3 for several hours, but sadly the download ended up with this error, despite the fact that no network interruption were noticeable
Loaded 12000000 of 20000000 bytes from 9beb5d (60.0 %): sha1://29c76d52197791a4065dde95f8639cdbea558777?chunkOf=1b8592f0240603ae1019379cb47bad6475503aaf~28880000000~28900000000 Error loading file: Download failed.: sha1://29c76d52197791a4065dde95f8639cdbea558777?chunkOf=1b8592f0240603ae1019379cb47bad6475503aaf~28880000000~28900000000 Elapsed time: 20224.59192246478