dat-ecosystem-archive / datproject-discussions

a repo for discussions and other non-code organizing stuff [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
65 stars 6 forks source link

Building a repository for 3D behavioral imaging data #50

Open joehand opened 8 years ago

joehand commented 8 years ago

From @alexbw on November 18, 2014 3:49

Hi all, my name's Alex, I'm a friend of Max's in Boston. I'm a PhD student in neuroscience at Harvard, and I've built a tool which allows you to record the behavior of animals in 3D using the Microsoft Kinect. Most of my research revolves around what to actually DO with that data, but getting nicely processed data in the first place has been a challenge.

Here's an example of what the data looks like once it's processed and ready for analysis: (yes, that's a little blobby lab mouse running around, as detected by the Kinect, and the inset is the extracted and aligned mouse)

https://www.dropbox.com/s/3v7kjwwyfrjp02u/mouse_clip.mp4?dl=0

Turns out, looking at behavior quantitatively in this way is shockingly new and useful to neuroscientists at large, so a lot of folks have been asking to collaborate with us. We've been overwhelmed. So, we started to partner with some labs and companies to build a platform for recording, uploading, storing, sharing and analyzing this data.

As you might guess, our main problem is the size of the data. We need to get it to a central location for processing (requires lots of computers crunching on hours of data to get results useful to researchers, currently. We're working on making it more efficient, but for now, we need much more than a desktop), and I just don't know an efficient way to get tens of gigabytes of data per day reliably to EC2 for storage and processing.

I know that Max has been working on this project with lots of brilliant people, and I asked Max on Twitter if we could talk about this problem, and he said to post an issue here. So, here's the issue!

To be clear, we

What do you think? Happy to answer any questions, provide more images/movies to illustrate.

Copied from original issue: maxogden/dat#219

juliangruber commented 8 years ago

Do need a centralized storage location, mostly because processing requires big guns.

does that mean that storage and processing need happen on the same machine?

Except for this, this sounds perfect for hosted dat

juliangruber commented 8 years ago

I guess we could talk about a hosted data pipeline for dat as well...

alexbw commented 8 years ago

They don't have to happen on the same machine. We'll just pipe the data from the storage machine to the processing machine. Processing the data in this case greatly reduces the data size we need to handle.

On Fri, Jun 17, 2016 at 3:13 PM Julian Gruber notifications@github.com wrote:

Do need a centralized storage location, mostly because processing requires big guns.

does that mean that storage and processing need happen on the same machine?

Except for this, this sounds perfect for hosted dat

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/datproject/discussions/issues/50#issuecomment-226856736, or mute the thread https://github.com/notifications/unsubscribe/AAJ4j6Vqt76QgUrpZkoZXdlMesSV3i6qks5qMvHGgaJpZM4I4oS9 .

todrobbins commented 8 years ago

Super interesting work! Though the Dropbox link to the video clip isn't working. Is each session upload ~10GB or is that a collection of sessions from a single user?

Cheers!