ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
Apache License 2.0
59 stars 13 forks source link

TACC/Arctos media storage limitations? #2582

Closed anna-chinn closed 2 years ago

anna-chinn commented 4 years ago

Hi all! I'm hoping to get some input on the feasibility of an upcoming CLIR grant proposal plan that CHAS is putting together. The proposal is a resubmission, so Erica may have discussed this with some of you a few years ago, but Dawn and I need a refresher!

We are proposing to digitize (i.e. scan) our motion picture film collection and, along the way, catalogue the films in Arctos, store both high and low res copies of the video files at TACC Corral, and connect it all using the Arctos media module. When the project is complete, we'll have approximately 130 hours of scanned footage and 1,400 catalog records.

Based some preliminary scans we had done this winter, it looks like scanning and preserving the entire collection will require ~20TB (!!) of storage space. Is this amount of TACC storage feasible given a generic Arctos MOU? Should we need a separate agreement with TACC, how have other folks handled ancillary TACC agreements for oversized media storage?

Any and all feedback welcome, either here or by email. Thanks for your help. :)

@droberts49

jldunnum commented 4 years ago

Hi Anna, I don't have the answer but will weigh in that the storage issue is critical going forward in terms of TACC hosting CT scan data from Artcos collections as well.


Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131


From: Anna Chinn notifications@github.com Sent: Tuesday, April 7, 2020 1:19 PM To: ArctosDB/arctos arctos@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [ArctosDB/arctos] TACC/Arctos media storage limitations? (#2582)

UNM-IT Warning: This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)

Hi all! I'm hoping to get some input on the feasibility of an upcoming CLIR grant proposal plan that CHAS is putting together. The proposal is a resubmission, so Erica may have discussed this with some of you a few years ago, but Dawn and I need a refresher!

We are proposing to digitize (i.e. scan) our motion picture film collection and, along the way, catalogue the films in Arctos, store both high and low res copies of the video files at TACC Corral, and connect it all using the Arctos media module. When the project is complete, we'll have approximately 130 hours of scanned footage and 1,400 catalog records.

Based some preliminary scans we had done this winter, it looks like scanning and preserving the entire collection will require ~20TB (!!) of storage space. Is this amount of TACC storage feasible given a generic Arctos MOU? Should we need a separate agreement with TACC, how have other folks handled ancillary TACC agreements for oversized media storage?

Any and all feedback welcome, either here or by email. Thanks for your help. :)

@droberts49https://github.com/droberts49

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/2582, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA3PLGDP3RG4ZMGZIW3RLN4FRANCNFSM4MDLXOSA.

campmlc commented 4 years ago

Similarly, I was just in conversation with someone who wants to see if we have the capacity to archive several terabytes of bat call audiofiles.

On Tue, Apr 7, 2020 at 1:30 PM jldunnum notifications@github.com wrote:

Hi Anna, I don't have the answer but will weigh in that the storage issue is critical going forward in terms of TACC hosting CT scan data from Artcos collections as well.


Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131


From: Anna Chinn notifications@github.com Sent: Tuesday, April 7, 2020 1:19 PM To: ArctosDB/arctos arctos@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [ArctosDB/arctos] TACC/Arctos media storage limitations? (#2582)

UNM-IT Warning: This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)

Hi all! I'm hoping to get some input on the feasibility of an upcoming CLIR grant proposal plan that CHAS is putting together. The proposal is a resubmission, so Erica may have discussed this with some of you a few years ago, but Dawn and I need a refresher!

We are proposing to digitize (i.e. scan) our motion picture film collection and, along the way, catalogue the films in Arctos, store both high and low res copies of the video files at TACC Corral, and connect it all using the Arctos media module. When the project is complete, we'll have approximately 130 hours of scanned footage and 1,400 catalog records.

Based some preliminary scans we had done this winter, it looks like scanning and preserving the entire collection will require ~20TB (!!) of storage space. Is this amount of TACC storage feasible given a generic Arctos MOU? Should we need a separate agreement with TACC, how have other folks handled ancillary TACC agreements for oversized media storage?

Any and all feedback welcome, either here or by email. Thanks for your help. :)

@droberts49https://github.com/droberts49

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub< https://github.com/ArctosDB/arctos/issues/2582>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AED2PA3PLGDP3RG4ZMGZIW3RLN4FRANCNFSM4MDLXOSA

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-610577543, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBAF55GUXRSKF3KNEGDRLN5NLANCNFSM4MDLXOSA .

dustymc commented 4 years ago

I don't have anything very solid either. I think the current Arctos allocation is "about 10TB" but I'm not sure where that came from (seemed like a very large number at the time?) or how hard the limit is.

I believe Morphosource would accept CT scans, but if they were mine (or even something I was using for research) I'd probably want a set of the originals somewhere like TACC as well, Just In Case.

It's also not clear what "officially" happens when your grant is over and you can't pay for storage any longer. I'm positive that TACC would do whatever they could to preserve the files, but they have costs too. I'm tempted to suggest we find a way to write that into Arctos costs, but I'd not want to burden everyone else paying for the 500 exabytes of random internet cat pictures that some collection will eventually show up with.

Maybe we as a community can at least develop guidelines or best practices?

campmlc commented 4 years ago

This sounds like something to include in the data and business plans. I agree with writing TACC storage costs into the permanent Arctos budget. Jon, how many terabytes are we looking at for CT scans? How much would it cost to reserve 100TB as a community?

On Tue, Apr 7, 2020 at 1:48 PM dustymc notifications@github.com wrote:

  • UNM-IT Warning:* This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)

I don't have anything very solid either. I think the current Arctos allocation is "about 10TB" but I'm not sure where that came from (seemed like a very large number at the time?) or how hard the limit is.

I believe Morphosource would accept CT scans, but if they were mine (or even something I was using for research) I'd probably want a set of the originals somewhere like TACC as well, Just In Case.

It's also not clear what "officially" happens when your grant is over and you can't pay for storage any longer. I'm positive that TACC would do whatever they could to preserve the files, but they have costs too. I'm tempted to suggest we find a way to write that into Arctos costs, but I'd not want to burden everyone else paying for the 500 exabytes of random internet cat pictures that some collection will eventually show up with.

Maybe we as a community can at least develop guidelines or best practices?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-610585778, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBHZXPSYDFNWEX52RW3RLN7QZANCNFSM4MDLXOSA .

dustymc commented 4 years ago

cost to reserve 100TB

https://portal.tacc.utexas.edu/user-guides/corral

~$12K

jldunnum commented 4 years ago

We won't know exactly how much storage we will need for this going forward. Currently we (MSB mammals) have about 800 specimens (10-15 GB each) scanned and on harddrives awaiting archival. For our current project we anticipate another 500 or so but I know MVZ already has a bunch of scanned material as well and we want to think about many more down the road. MSB will be uploading ours to Morphosource but by no means do we want that to be the only location (it will go away at some point). We envision also storing at TACC and having the Arctos record link out to the Morphosource record too.


Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351

MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals

Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131


From: dustymc notifications@github.com Sent: Tuesday, April 7, 2020 3:41 PM To: ArctosDB/arctos arctos@noreply.github.com Cc: Jonathan Dunnum jldunnum@unm.edu; Comment comment@noreply.github.com Subject: Re: [ArctosDB/arctos] TACC/Arctos media storage limitations? (#2582)

UNM-IT Warning: This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)

cost to reserve 100TB

https://portal.tacc.utexas.edu/user-guides/corral

~$12K

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/2582#issuecomment-610634733, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA5BWJNGG7S7DMTUJQDRLOMXDANCNFSM4MDLXOSA.

dustymc commented 4 years ago

800 specimens (10-15 GB each) scanned and on harddrives awaiting archival

FWIW the last time I filled a bunch of external drives up we eventually lost something like 40% of them. That seems to be a wildly variable number, but the costs of buying and managing massive redundancy or of potentially not getting all of your 12TB back out should be considered somewhere in here. Then we permanently lost a bunch of images from MorphBank, so I'm a little paranoid of services as well....

Arctos saving people from facing that scenario has some value, I think. (Adding things like better upload tools would enhance that.)

We are hoping to do much more than link to MS - https://github.com/ArctosDB/arctos/issues/1882

TACC has some tape archive that is less accessible but significantly cheaper than disk. Tossing 12TB of "hope we never need this" .stl files on there and getting them back with a couple days notice might be feasible. Maybe that could even make sense for "originals" (DNGs and WAVs and such) in Arctos, but it would probably take a fair bit of data for the storage costs to balance the development costs.

campmlc commented 4 years ago

Can you explain how WAV files are currently stored, or how their derivatives are stored at TACC? What file extensions would they be converted to? This is also relevant to the bat call files, which are currently in some sort of old anabat extension file readable by which could theoretically be converted to .wav - trying to get more info on this. https://www.titley-scientific.com/us/support/faqs#Question6 https://www.titley-scientific.com/us/downloads/analysis-software?SID=go0te1j5g2hc31u1j7pfqerk47

On Wed, Apr 8, 2020 at 10:19 AM dustymc notifications@github.com wrote:

  • UNM-IT Warning:* This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)

800 specimens (10-15 GB each) scanned and on harddrives awaiting archival

FWIW the last time I filled a bunch of external drives up we eventually lost something like 40% of them. That seems to be a wildly variable number, but the costs of buying and managing massive redundancy or of potentially not getting all of your 12TB back out should be considered somewhere in here. Then we permanently lost a bunch of images from MorphBank, so I'm a little paranoid of services as well....

Arctos saving people from facing that scenario has some value, I think. (Adding things like better upload tools would enhance that.)

We are hoping to do much more than link to MS - #1882 https://github.com/ArctosDB/arctos/issues/1882

TACC has some tape archive that is less accessible but significantly cheaper than disk. Tossing 12TB of "hope we never need this" .stl files on there and getting them back with a couple days notice might be feasible. Maybe that could even make sense for "originals" (DNGs and WAVs and such) in Arctos, but it would probably take a fair bit of data for the storage costs to balance the development costs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-611053926, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBD4CGUJKA27WS6R543RLSPZJANCNFSM4MDLXOSA .

dustymc commented 4 years ago

http://arctos.database.museum/media/10273014 is a 2MB MP3 file that your browser probably knows how to stream. It's something like a JPG.

http://arctos.database.museum/media/10060966 is the 29MB original that your browser will probably just download. It's more like a DNG (or maybe TIFF - I'm not sure what comes out of a digital recorder).

The former is missing a LOT of information, but you probably can't tell. You'd want the latter if you were doing some sort of analysis; your computer certainly can use the information that your brain doesn't need.

http://arctos.database.museum/guid/MVZ:Bird:183070 is a related record.

http://handbook.arctosdb.org/documentation/media.html#binary-object-creation-guidelines

Everything's stored on disk now. The MP3 probably does what 99% of users need. We could potentially save some $$ by moving the archival to cheaper storage. We'd pay some of that back in development - we'd need a way to push to and retrieve from the archival storage, and that'd probably involve people. Accessing the original would involve some sort of non-instant process - maybe a monthly pull after being approved or something.

To be clear, I'm a big fan of keeping everything accessible and on disk (and properly licensed to control usage), BUT there may be a balance between that and the monetary costs of storing things that most users don't need.

I don't know enough about audio, much less bat audio, to say anything intelligent, but the general idea of keeping all of the information in an open-source container and a "viewable" derivative in a more accessible and physically smaller format should carry across. Given a DNG file and the open-source DNG specification, the survivors of The Apocalypse should have little trouble accessing everything that came out of the camera after they rebuild a technical civilization; whatever you do with audio should seek to provide the same sort of persistence.

Maybe @ccicero can comment more specifically on audio formats.

amgunderson commented 4 years ago

This is how we do bats with WAV files. The image is what you see when you open the WAV file in Kaleidoscope software. http://arctos.database.museum/guid/UAMObs:Mamm:190. Chrome opens the wav and plays the audio but you can right click to download the linked file.

On Wed, Apr 8, 2020 at 9:38 AM dustymc notifications@github.com wrote:

http://arctos.database.museum/media/10273014 is a 2MB MP3 file that your browser probably knows how to stream. It's something like a JPG.

http://arctos.database.museum/media/10060966 is the 29MB original that your browser will probably just download. It's more like a DNG (or maybe TIFF - I'm not sure what comes out of a digital recorder).

The former is missing a LOT of information, but you probably can't tell. You'd want the latter if you were doing some sort of analysis; your computer certainly can use the information that your brain doesn't need.

http://arctos.database.museum/guid/MVZ:Bird:183070 is a related record.

http://handbook.arctosdb.org/documentation/media.html#binary-object-creation-guidelines

Everything's stored on disk now. The MP3 probably does what 99% of users need. We could potentially save some $$ by moving the archival to cheaper storage. We'd pay some of that back in development - we'd need a way to push to and retrieve from the archival storage, and that'd probably involve people. Accessing the original would involve some sort of non-instant process - maybe a monthly pull after being approved or something.

To be clear, I'm a big fan of keeping everything accessible and on disk (and properly licensed to control usage), BUT there may be a balance between that and the monetary costs of storing things that most users don't need.

I don't know enough about audio, much less bat audio, to say anything intelligent, but the general idea of keeping all of the information in an open-source container and a "viewable" derivative in a more accessible and physically smaller format should carry across. Given a DNG file and the open-source DNG specification, the survivors of The Apocalypse should have little trouble accessing everything that came out of the camera after they rebuild a technical civilization; whatever you do with audio should seek to provide the same sort of persistence.

Maybe @ccicero https://github.com/ccicero can comment more specifically on audio formats.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-611093738, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAMW6UNBBQT74WRC25FR5LRLSZALANCNFSM4MDLXOSA .

-- Aren Gunderson Mammal Collection Manager University of Alaska Museum of the North http://www.uaf.edu/museum 1962 Yukon Drive Fairbanks, AK 99775 amgunderson@alaska.edu 907-474-6947

campmlc commented 4 years ago

@amgunderson That's interesting. So you have to have the Kaleidoscope software on your desktop to hear it? That sounds similar to the Anabat situation.

amgunderson commented 4 years ago

No, you can hear it through chrome, or any audio player, but you can view/analyse it in Kaleidoscope (there is a free version here, https://www.wildlifeacoustics.com/ if you make an account).

On Wed, Apr 8, 2020 at 9:59 AM Mariel Campbell notifications@github.com wrote:

@amgunderson https://github.com/amgunderson That's interesting. So you have to have the Kaleidoscope software on your desktop to hear it? That sounds similar to the Anabat situation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-611104462, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAMW6UXY54MAMCQXSWTQATRLS3QVANCNFSM4MDLXOSA .

-- Aren Gunderson Mammal Collection Manager University of Alaska Museum of the North http://www.uaf.edu/museum 1962 Yukon Drive Fairbanks, AK 99775 amgunderson@alaska.edu 907-474-6947

anna-chinn commented 4 years ago

Accessing the original would involve some sort of non-instant process - maybe a monthly pull after being approved or something.

This kind of archival storage would work for us. I imagine that smaller derivative videos files (~175MB/minute of footage) would be useable for most purposes other than screenings. On the scale of the entire project, these derivatives still require about 1.4TB to store, but that scale seems a lot more manageable.

CHAS doesn't have the internal budget to spend $2.4K annually to keep 20TB of material on TACC once the grant period ends, but we have enough wiggle room that we could contribute to more wholistic changes to Arctos media/data storage.

dustymc commented 4 years ago

From TACC:

The tape archive is available for 1/4 the usual Corral disk cost, so around $30/TB/year at present. I’m not sure quite how we would so a periodic retrieval, but we can figure something out. The other option would be to use the unreplicated Corral storage, which is currently 1/2 of the Corral fee or $60/TB/year. That would at least be simpler to manage.

doesn't have the internal budget ... once the grant period ends,

So say we all. I haven't a clue and it's well beyond the scope of this issue, but that somehow needs to be viewed similarly to cases and jars and freezers and such by funding agencies and institutional administrators.

campmlc commented 4 years ago

Sounds like yes, we need to budget for short term media storage in grants, and also budget for the long term in the Arctos budget, but how to allocate costs between collections with expensive storage and others with none. New Issue #2587