A reference page of where to store data

tracykteal commented 5 years ago

In this lesson we talk about keeping raw data raw and backing up that data. We often then get questions about what options there might be for storing data. We don't currently have any guidelines, but it would be nice to have a reference page in Extras with that information that we could refer learners to.

@jperkel had some ideas for a reference page like this, with an 'if this then that' format, recognizing that people have different types and sizes of data, and different sets of options depending on local resources and data type. He offered to start putting something together.

hoytpr commented 5 years ago

Good Idea @tracykteal and @jperkel is welcome to bring some suggestions here. I'm feeling like we might want to get something going on the Development page, when combined with @ErinBecker 's remarks. Our data goes to a local "cloud" on the supercomputer (months), then to AWS (months), then is archived on tapes (indefinite). We also provide automated download scripts for just the data, then the ENTIRE output of RUN DATA.

This is related but beyond the scope of the current lesson: I also saw some people on Twitter worrying about the lack of metadata included with sequence data from providers, making submission to the SRA more difficult. They suggested the sequencing center submit directly to NCBI. That would save some time, for them anyway. But as an NCBI submission center, I'm familiar with the metadata on the sequencer side which can be parsed (probably) easily. It's the submitter's metadata that is the most difficult to assemble (IMHO). While I haven't had a chance to look at the new NCBI submission site, It's likely a lot of the same info is still needed (strain/species, strain origin, experimental conditions, MANUSCRIPT authors and title, etc.). The sequencing center doesn't have this info, and making it mandatory for sequencing would, in my experience, only generate careless responses.

jperkel commented 5 years ago

I'm on it, @tracykteal, thanks!

jperkel commented 5 years ago

With help from @tracykteal, here's my first stab at some useful backup tips:

Tips for managing your data

3-2-1: Fires, floods, and theft all happen -- not to mention bitrot. So, keep at least three copies of your data, on two different media, at least one of which is off-site.
Talk to the experts: Make your institution’s professionals your first call. Ask IT team colleagues about free or low-cost backup options available; your librarian about data management strategies; and your grant officers about regulations regarding how, and how long, to store data
Safeguard privacy: Private data, such as student information, cannot be stored just anywhere. And if those data are lost or stolen, you could face consequences. If you have, or plan to store such data, speak to your institution’s IT team for advice.
Manage your data: Make your backups more effective, and more future-proof, by developing a data-management plan. Establish file-naming conventions and organizational strategies -- for instance, that each project gets its own directory, with dedicated subdirectories for data and code. Decide which data will be backed up, and which can be discarded. Determine where different data types will be backed up, and how often. And, document everything: Keep a copy of your data management plan where people can find it, and annotate your experiments with README files that indicate the experiment, the file structures, required applications or scripts, and so on.
Share with others: Your colleagues may also need access to your data. Can someone else understand what the data are and where they’re located if you’re not around? Make sure they have access and permissions to the data and that they are able to understand what the files are and how they are organized (see Manage your data).
Be realistic: Once you develop a backup strategy, discuss it in the lab. Is it accessible to new colleagues, or only command-line experts? Is it doable after pulling an all-nighter? And how effective is it? Simulate what would happen if disaster should strike: What data will you lose, and what can you recover?
Automate: The thing about backups is, you need them when you least expect it. So don’t rely on remembering to run a backup -- automate them.
Test your backup: Do your backups actually work? Test them to find out. Make sure you can actually open key files, and that you have the required applications to read them. Test them on a different system if possible, as, if your primary computer should fail, you won’t have access to its contents.
Protect your raw data: Always backup your raw data. And keep it safe: Duplicate the data before working with it, and open it read-only. An errant file-open command in write mode (eg, f = open (filename, ‘w’)) instead of read (‘r’) is all it takes to wipe out a file for good.
Keep one backup offline: An always-on backup system is a recipe for disaster. If your computer is hacked, or suffers a power-spike, the backup system can also be compromised. So keep at least one backup offline, just in case.
Plan ahead: Storage media don’t last forever, so periodically review your backup strategy to make sure it’s current. Do you still have devices that can read the backups? Is it time to migrate to a newer platform? And don’t neglect your cloud storage: companies can shift priorities, or you can lose your passwords. So double-up on those too, just in case.
Expect the unexpected: Make periodic disaster assessments to safeguard your hardware. If you live in a flood zone, the basement may not be the best place to store a server. If you live in an earthquake-prone area, you might anchor your computers so they don’t fall over. And if your computers are located near fire-control sprinklers, you might raise them off the ground, or shield them from potential leaks.

hoytpr commented 5 years ago

Hi @jperkel and thanks. A very nice set of standards! Your effort is very much appreciated. Just VERY minor suggestions if that's okay.

I'm not sure what "bitrot" is exactly (guessing it means deterioration of magnetic or optical bits over time, or it could have a more specific meaning). Should we mention this can be mitigated by storing data media under proper environmental conditions?

With "Protect your raw data" when using your Python scripting example maybe mention: ..."e.g. in Python: (f = open (filename, ‘w’)) instead of read ‘r’ is all it takes..."

What about including MD5 or SHA-1 checksums with data?

The periodic disaster assessments is a great inclusion. Really complete.

Let me know what you think and thanks again Peter

jperkel commented 5 years ago

Hi Peter, Thanks, these are good ideas. I've incorporated these suggestions in the pull request I submitted earlier today. (#264 ) What do you think?

Thanks, jeff

hoytpr commented 5 years ago

It looks good to me. @tracykteal

tracykteal commented 5 years ago

Thank you @jperkel and for the comments @hoytpr. Looks good to me!

hoytpr commented 5 years ago

closes #264

datacarpentry / spreadsheet-ecology-lesson

A reference page of where to store data #261