Closed anna-chinn closed 2 years ago
Hi Anna, I don't have the answer but will weigh in that the storage issue is critical going forward in terms of TACC hosting CT scan data from Artcos collections as well.
Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351
MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals
Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131
From: Anna Chinn notifications@github.com Sent: Tuesday, April 7, 2020 1:19 PM To: ArctosDB/arctos arctos@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [ArctosDB/arctos] TACC/Arctos media storage limitations? (#2582)
UNM-IT Warning: This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)
Hi all! I'm hoping to get some input on the feasibility of an upcoming CLIR grant proposal plan that CHAS is putting together. The proposal is a resubmission, so Erica may have discussed this with some of you a few years ago, but Dawn and I need a refresher!
We are proposing to digitize (i.e. scan) our motion picture film collection and, along the way, catalogue the films in Arctos, store both high and low res copies of the video files at TACC Corral, and connect it all using the Arctos media module. When the project is complete, we'll have approximately 130 hours of scanned footage and 1,400 catalog records.
Based some preliminary scans we had done this winter, it looks like scanning and preserving the entire collection will require ~20TB (!!) of storage space. Is this amount of TACC storage feasible given a generic Arctos MOU? Should we need a separate agreement with TACC, how have other folks handled ancillary TACC agreements for oversized media storage?
Any and all feedback welcome, either here or by email. Thanks for your help. :)
@droberts49https://github.com/droberts49
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/2582, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA3PLGDP3RG4ZMGZIW3RLN4FRANCNFSM4MDLXOSA.
Similarly, I was just in conversation with someone who wants to see if we have the capacity to archive several terabytes of bat call audiofiles.
On Tue, Apr 7, 2020 at 1:30 PM jldunnum notifications@github.com wrote:
Hi Anna, I don't have the answer but will weigh in that the storage issue is critical going forward in terms of TACC hosting CT scan data from Artcos collections as well.
Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351
MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals
Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131
From: Anna Chinn notifications@github.com Sent: Tuesday, April 7, 2020 1:19 PM To: ArctosDB/arctos arctos@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [ArctosDB/arctos] TACC/Arctos media storage limitations? (#2582)
UNM-IT Warning: This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)
Hi all! I'm hoping to get some input on the feasibility of an upcoming CLIR grant proposal plan that CHAS is putting together. The proposal is a resubmission, so Erica may have discussed this with some of you a few years ago, but Dawn and I need a refresher!
We are proposing to digitize (i.e. scan) our motion picture film collection and, along the way, catalogue the films in Arctos, store both high and low res copies of the video files at TACC Corral, and connect it all using the Arctos media module. When the project is complete, we'll have approximately 130 hours of scanned footage and 1,400 catalog records.
Based some preliminary scans we had done this winter, it looks like scanning and preserving the entire collection will require ~20TB (!!) of storage space. Is this amount of TACC storage feasible given a generic Arctos MOU? Should we need a separate agreement with TACC, how have other folks handled ancillary TACC agreements for oversized media storage?
Any and all feedback welcome, either here or by email. Thanks for your help. :)
@droberts49https://github.com/droberts49
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub< https://github.com/ArctosDB/arctos/issues/2582>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AED2PA3PLGDP3RG4ZMGZIW3RLN4FRANCNFSM4MDLXOSA
.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-610577543, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBAF55GUXRSKF3KNEGDRLN5NLANCNFSM4MDLXOSA .
I don't have anything very solid either. I think the current Arctos allocation is "about 10TB" but I'm not sure where that came from (seemed like a very large number at the time?) or how hard the limit is.
I believe Morphosource would accept CT scans, but if they were mine (or even something I was using for research) I'd probably want a set of the originals somewhere like TACC as well, Just In Case.
It's also not clear what "officially" happens when your grant is over and you can't pay for storage any longer. I'm positive that TACC would do whatever they could to preserve the files, but they have costs too. I'm tempted to suggest we find a way to write that into Arctos costs, but I'd not want to burden everyone else paying for the 500 exabytes of random internet cat pictures that some collection will eventually show up with.
Maybe we as a community can at least develop guidelines or best practices?
This sounds like something to include in the data and business plans. I agree with writing TACC storage costs into the permanent Arctos budget. Jon, how many terabytes are we looking at for CT scans? How much would it cost to reserve 100TB as a community?
On Tue, Apr 7, 2020 at 1:48 PM dustymc notifications@github.com wrote:
- UNM-IT Warning:* This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)
I don't have anything very solid either. I think the current Arctos allocation is "about 10TB" but I'm not sure where that came from (seemed like a very large number at the time?) or how hard the limit is.
I believe Morphosource would accept CT scans, but if they were mine (or even something I was using for research) I'd probably want a set of the originals somewhere like TACC as well, Just In Case.
It's also not clear what "officially" happens when your grant is over and you can't pay for storage any longer. I'm positive that TACC would do whatever they could to preserve the files, but they have costs too. I'm tempted to suggest we find a way to write that into Arctos costs, but I'd not want to burden everyone else paying for the 500 exabytes of random internet cat pictures that some collection will eventually show up with.
Maybe we as a community can at least develop guidelines or best practices?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-610585778, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBHZXPSYDFNWEX52RW3RLN7QZANCNFSM4MDLXOSA .
We won't know exactly how much storage we will need for this going forward. Currently we (MSB mammals) have about 800 specimens (10-15 GB each) scanned and on harddrives awaiting archival. For our current project we anticipate another 500 or so but I know MVZ already has a bunch of scanned material as well and we want to think about many more down the road. MSB will be uploading ours to Morphosource but by no means do we want that to be the only location (it will go away at some point). We envision also storing at TACC and having the Arctos record link out to the Morphosource record too.
Jonathan L. Dunnum Ph.D. Senior Collection Manager Division of Mammals, Museum of Southwestern Biology University of New Mexico Albuquerque, NM 87131 (505) 277-9262 Fax (505) 277-1351
MSB Mammals website: http://www.msb.unm.edu/mammals/index.html Facebook: http://www.facebook.com/MSBDivisionofMammals
Shipping Address: Museum of Southwestern Biology Division of Mammals University of New Mexico CERIA Bldg 83, Room 204 Albuquerque, NM 87131
From: dustymc notifications@github.com Sent: Tuesday, April 7, 2020 3:41 PM To: ArctosDB/arctos arctos@noreply.github.com Cc: Jonathan Dunnum jldunnum@unm.edu; Comment comment@noreply.github.com Subject: Re: [ArctosDB/arctos] TACC/Arctos media storage limitations? (#2582)
UNM-IT Warning: This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)
cost to reserve 100TB
https://portal.tacc.utexas.edu/user-guides/corral
~$12K
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ArctosDB/arctos/issues/2582#issuecomment-610634733, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AED2PA5BWJNGG7S7DMTUJQDRLOMXDANCNFSM4MDLXOSA.
800 specimens (10-15 GB each) scanned and on harddrives awaiting archival
FWIW the last time I filled a bunch of external drives up we eventually lost something like 40% of them. That seems to be a wildly variable number, but the costs of buying and managing massive redundancy or of potentially not getting all of your 12TB back out should be considered somewhere in here. Then we permanently lost a bunch of images from MorphBank, so I'm a little paranoid of services as well....
Arctos saving people from facing that scenario has some value, I think. (Adding things like better upload tools would enhance that.)
We are hoping to do much more than link to MS - https://github.com/ArctosDB/arctos/issues/1882
TACC has some tape archive that is less accessible but significantly cheaper than disk. Tossing 12TB of "hope we never need this" .stl files on there and getting them back with a couple days notice might be feasible. Maybe that could even make sense for "originals" (DNGs and WAVs and such) in Arctos, but it would probably take a fair bit of data for the storage costs to balance the development costs.
Can you explain how WAV files are currently stored, or how their derivatives are stored at TACC? What file extensions would they be converted to? This is also relevant to the bat call files, which are currently in some sort of old anabat extension file readable by which could theoretically be converted to .wav - trying to get more info on this. https://www.titley-scientific.com/us/support/faqs#Question6 https://www.titley-scientific.com/us/downloads/analysis-software?SID=go0te1j5g2hc31u1j7pfqerk47
On Wed, Apr 8, 2020 at 10:19 AM dustymc notifications@github.com wrote:
- UNM-IT Warning:* This message was sent from outside of the LoboMail system. Do not click on links or open attachments unless you are sure the content is safe. (2.3)
800 specimens (10-15 GB each) scanned and on harddrives awaiting archival
FWIW the last time I filled a bunch of external drives up we eventually lost something like 40% of them. That seems to be a wildly variable number, but the costs of buying and managing massive redundancy or of potentially not getting all of your 12TB back out should be considered somewhere in here. Then we permanently lost a bunch of images from MorphBank, so I'm a little paranoid of services as well....
Arctos saving people from facing that scenario has some value, I think. (Adding things like better upload tools would enhance that.)
We are hoping to do much more than link to MS - #1882 https://github.com/ArctosDB/arctos/issues/1882
TACC has some tape archive that is less accessible but significantly cheaper than disk. Tossing 12TB of "hope we never need this" .stl files on there and getting them back with a couple days notice might be feasible. Maybe that could even make sense for "originals" (DNGs and WAVs and such) in Arctos, but it would probably take a fair bit of data for the storage costs to balance the development costs.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-611053926, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBD4CGUJKA27WS6R543RLSPZJANCNFSM4MDLXOSA .
http://arctos.database.museum/media/10273014 is a 2MB MP3 file that your browser probably knows how to stream. It's something like a JPG.
http://arctos.database.museum/media/10060966 is the 29MB original that your browser will probably just download. It's more like a DNG (or maybe TIFF - I'm not sure what comes out of a digital recorder).
The former is missing a LOT of information, but you probably can't tell. You'd want the latter if you were doing some sort of analysis; your computer certainly can use the information that your brain doesn't need.
http://arctos.database.museum/guid/MVZ:Bird:183070 is a related record.
http://handbook.arctosdb.org/documentation/media.html#binary-object-creation-guidelines
Everything's stored on disk now. The MP3 probably does what 99% of users need. We could potentially save some $$ by moving the archival to cheaper storage. We'd pay some of that back in development - we'd need a way to push to and retrieve from the archival storage, and that'd probably involve people. Accessing the original would involve some sort of non-instant process - maybe a monthly pull after being approved or something.
To be clear, I'm a big fan of keeping everything accessible and on disk (and properly licensed to control usage), BUT there may be a balance between that and the monetary costs of storing things that most users don't need.
I don't know enough about audio, much less bat audio, to say anything intelligent, but the general idea of keeping all of the information in an open-source container and a "viewable" derivative in a more accessible and physically smaller format should carry across. Given a DNG file and the open-source DNG specification, the survivors of The Apocalypse should have little trouble accessing everything that came out of the camera after they rebuild a technical civilization; whatever you do with audio should seek to provide the same sort of persistence.
Maybe @ccicero can comment more specifically on audio formats.
This is how we do bats with WAV files. The image is what you see when you open the WAV file in Kaleidoscope software. http://arctos.database.museum/guid/UAMObs:Mamm:190. Chrome opens the wav and plays the audio but you can right click to download the linked file.
On Wed, Apr 8, 2020 at 9:38 AM dustymc notifications@github.com wrote:
http://arctos.database.museum/media/10273014 is a 2MB MP3 file that your browser probably knows how to stream. It's something like a JPG.
http://arctos.database.museum/media/10060966 is the 29MB original that your browser will probably just download. It's more like a DNG (or maybe TIFF - I'm not sure what comes out of a digital recorder).
The former is missing a LOT of information, but you probably can't tell. You'd want the latter if you were doing some sort of analysis; your computer certainly can use the information that your brain doesn't need.
http://arctos.database.museum/guid/MVZ:Bird:183070 is a related record.
http://handbook.arctosdb.org/documentation/media.html#binary-object-creation-guidelines
Everything's stored on disk now. The MP3 probably does what 99% of users need. We could potentially save some $$ by moving the archival to cheaper storage. We'd pay some of that back in development - we'd need a way to push to and retrieve from the archival storage, and that'd probably involve people. Accessing the original would involve some sort of non-instant process - maybe a monthly pull after being approved or something.
To be clear, I'm a big fan of keeping everything accessible and on disk (and properly licensed to control usage), BUT there may be a balance between that and the monetary costs of storing things that most users don't need.
I don't know enough about audio, much less bat audio, to say anything intelligent, but the general idea of keeping all of the information in an open-source container and a "viewable" derivative in a more accessible and physically smaller format should carry across. Given a DNG file and the open-source DNG specification, the survivors of The Apocalypse should have little trouble accessing everything that came out of the camera after they rebuild a technical civilization; whatever you do with audio should seek to provide the same sort of persistence.
Maybe @ccicero https://github.com/ccicero can comment more specifically on audio formats.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-611093738, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAMW6UNBBQT74WRC25FR5LRLSZALANCNFSM4MDLXOSA .
-- Aren Gunderson Mammal Collection Manager University of Alaska Museum of the North http://www.uaf.edu/museum 1962 Yukon Drive Fairbanks, AK 99775 amgunderson@alaska.edu 907-474-6947
@amgunderson That's interesting. So you have to have the Kaleidoscope software on your desktop to hear it? That sounds similar to the Anabat situation.
No, you can hear it through chrome, or any audio player, but you can view/analyse it in Kaleidoscope (there is a free version here, https://www.wildlifeacoustics.com/ if you make an account).
On Wed, Apr 8, 2020 at 9:59 AM Mariel Campbell notifications@github.com wrote:
@amgunderson https://github.com/amgunderson That's interesting. So you have to have the Kaleidoscope software on your desktop to hear it? That sounds similar to the Anabat situation.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2582#issuecomment-611104462, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAMW6UXY54MAMCQXSWTQATRLS3QVANCNFSM4MDLXOSA .
-- Aren Gunderson Mammal Collection Manager University of Alaska Museum of the North http://www.uaf.edu/museum 1962 Yukon Drive Fairbanks, AK 99775 amgunderson@alaska.edu 907-474-6947
Accessing the original would involve some sort of non-instant process - maybe a monthly pull after being approved or something.
This kind of archival storage would work for us. I imagine that smaller derivative videos files (~175MB/minute of footage) would be useable for most purposes other than screenings. On the scale of the entire project, these derivatives still require about 1.4TB to store, but that scale seems a lot more manageable.
CHAS doesn't have the internal budget to spend $2.4K annually to keep 20TB of material on TACC once the grant period ends, but we have enough wiggle room that we could contribute to more wholistic changes to Arctos media/data storage.
From TACC:
The tape archive is available for 1/4 the usual Corral disk cost, so around $30/TB/year at present. I’m not sure quite how we would so a periodic retrieval, but we can figure something out. The other option would be to use the unreplicated Corral storage, which is currently 1/2 of the Corral fee or $60/TB/year. That would at least be simpler to manage.
doesn't have the internal budget ... once the grant period ends,
So say we all. I haven't a clue and it's well beyond the scope of this issue, but that somehow needs to be viewed similarly to cases and jars and freezers and such by funding agencies and institutional administrators.
Sounds like yes, we need to budget for short term media storage in grants, and also budget for the long term in the Arctos budget, but how to allocate costs between collections with expensive storage and others with none. New Issue #2587
Hi all! I'm hoping to get some input on the feasibility of an upcoming CLIR grant proposal plan that CHAS is putting together. The proposal is a resubmission, so Erica may have discussed this with some of you a few years ago, but Dawn and I need a refresher!
We are proposing to digitize (i.e. scan) our motion picture film collection and, along the way, catalogue the films in Arctos, store both high and low res copies of the video files at TACC Corral, and connect it all using the Arctos media module. When the project is complete, we'll have approximately 130 hours of scanned footage and 1,400 catalog records.
Based some preliminary scans we had done this winter, it looks like scanning and preserving the entire collection will require ~20TB (!!) of storage space. Is this amount of TACC storage feasible given a generic Arctos MOU? Should we need a separate agreement with TACC, how have other folks handled ancillary TACC agreements for oversized media storage?
Any and all feedback welcome, either here or by email. Thanks for your help. :)
@droberts49