Create backups flowchart

ACharbonneau commented 9 years ago

as in http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf

mkuzak commented 7 years ago

I like this idea a lot. I'm afraid there are too many solutions and approaches, maybe we can come up with something simple. I'm curious if any one has seen something like that.

Tantoluwa commented 6 years ago

Please simplify the chart, it is cumbersome

hoytpr commented 6 years ago

@mkuzak @ACharbonneau @Tantoluwa Can you clarify this? Are you asking for a flow-chart describing a sequencing data backup system? The chart referenced is a chart describing how best to plot your data. Is this request that we develop a: data --> backup data --> archives chart?

We use a fairly simple but effective backup process for archiving data. The data are available for a few weeks-to-months on an internal cloud server while the data-owner is allowed to make their own backups locally. After time, the data are all compressed and moved to (duplicate) tape archives that are site-separated. Is this the kind of chart requested?

ACharbonneau commented 6 years ago

@hoytpr I was originally thinking of making something like a decision-tree/flow chart type of thing. I wanted a way to give people an idea of what type of backup system fit their needs, which usually just involves me asking them a bunch of questions that help me narrow down what might work.

So, at the top maybe your first decision might be "How much data do you have to back up"?

Other questions I had in my head were:

Is your data access restricted? (i.e. medical data has different security requirements than my plant stuff)
How long do you need to maintain the backup? (difference between tape/harddrive/etc)
How often do you need to access the data? (can't keep it on tape if you need it everyday)
How many people need access to the data? (stuff on a university HPC might not be accessible by collaborators at other universities)
How much TIME do you have to spend on this? (are you actually going to check your cold storage tapes every year)
How much MONEY do you have to spend on this? (can you just pay Amazon or similar?)

There were probably others, but my 2 year old recollection of a brainstorming meeting is failing me :)

hoytpr commented 6 years ago

Thanks @ACharbonneau, your 2-year-old memory is better than mine, and your topics are good! This would be part of the original data planning. When submitting samples, the size of the resulting data files can be estimated (length of reads, number of reads, etc.). There are already some "Guidelines for storing data" and the specifics would probably be different at different institutions. For my NSF-funded instrument, it's pretty simple: save everything, forever. It would also depend on whether you were using a core facility, external service, or your lab had their own sequencer. The common result of all those options is that you need your OWN PERSONAL copy of the data. Then you can work on it with whatever computer services are available, and archive it wherever you have those options. Accessibility restrictions will also vary between projects and places.

My opinion is that a more detailed workshop lesson here about planning would compliment the genomics wrangling, cloud genomics, and genomics workshops. Your storage topic would be good to include, and then reference more storage details in the Cloud Genomics, Data Wrangling, and HPC lessons. There is so much that goes into planning that many take for granted.

But if we go into great detail about storage options, it would be unbalanced relative to other planning details like "What's the project?" "Will you make your own libraries?" "How much coverage do you you need?" "What are the desired questions you want resolved?" "Will you have a lot of samples, or a few?" "How many milligrams of each sample can be produced?" "Are these metagenomic samples?" ... These are all parts of planning and organizing sequencing projects. So maybe that level of detail will have to wait until this lesson is expanded. @mkuzak @Roselynlemusinmegen @analeighgui @raynamharris

ACharbonneau commented 6 years ago

Right. I'm honestly not sure where "organization" falls in the lessons right now, but originally this was going to be one of the first things we talked about in genomics, as part of a big "actually plan your bioinformatics like a wet lab experiment" talk/soapbox. I never intended it to be a thing we spent a lot of time on in class, but rather a thing you could reference and let people go back to. We have a similar sort of thing in the cloud lesson: https://datacarpentry.org/cloud-genomics/04-which-cloud/index.html

I still think it would be a nice thing to have this 'decision tree/things to think about/whatever it is' as a link in one of the planning lessons. Even if it didn't get covered explicitly in most workshops, it would be useful for people coming back and reviewing. But obviously it's not a super high priority :)

JasonJWilliamsNY commented 5 years ago

Arizona BugBBQ: This information is useful but should not take up major real estate in the lesson. Probably a link or two in the lesson would be enough

Tantoluwa commented 5 years ago

That would be great

hoytpr commented 1 year ago

@JCSzamosi and @ACharbonneau et al. This is a stale issue, but it still has important points. I've recovered data for people that was 3-5 years old. But, the most important part of the issue is to make multiple backups of your raw data. The learners don't need to know how to operate an archival system. This is emphasized sufficiently in the genomics lessons. Please reopen if you disagree.

Peter

JCSzamosi commented 1 year ago

Is this something that should be referred to the curriculum committee?

hoytpr commented 1 year ago

Speaking for myself, I don't think so. There is enough emphasis on data protection (keep raw data raw, protecting data through permissions, making backups) it should be clear.

datacarpentry / organization-genomics

Create backups flowchart #23