Closed ACharbonneau closed 1 year ago
I like this idea a lot. I'm afraid there are too many solutions and approaches, maybe we can come up with something simple. I'm curious if any one has seen something like that.
Please simplify the chart, it is cumbersome
@mkuzak @ACharbonneau @Tantoluwa Can you clarify this? Are you asking for a flow-chart describing a sequencing data backup system? The chart referenced is a chart describing how best to plot your data. Is this request that we develop a: data --> backup data --> archives chart?
We use a fairly simple but effective backup process for archiving data. The data are available for a few weeks-to-months on an internal cloud server while the data-owner is allowed to make their own backups locally. After time, the data are all compressed and moved to (duplicate) tape archives that are site-separated. Is this the kind of chart requested?
@hoytpr I was originally thinking of making something like a decision-tree/flow chart type of thing. I wanted a way to give people an idea of what type of backup system fit their needs, which usually just involves me asking them a bunch of questions that help me narrow down what might work.
So, at the top maybe your first decision might be "How much data do you have to back up"?
Other questions I had in my head were:
There were probably others, but my 2 year old recollection of a brainstorming meeting is failing me :)
Thanks @ACharbonneau, your 2-year-old memory is better than mine, and your topics are good! This would be part of the original data planning. When submitting samples, the size of the resulting data files can be estimated (length of reads, number of reads, etc.). There are already some "Guidelines for storing data" and the specifics would probably be different at different institutions. For my NSF-funded instrument, it's pretty simple: save everything, forever. It would also depend on whether you were using a core facility, external service, or your lab had their own sequencer. The common result of all those options is that you need your OWN PERSONAL copy of the data. Then you can work on it with whatever computer services are available, and archive it wherever you have those options. Accessibility restrictions will also vary between projects and places.
My opinion is that a more detailed workshop lesson here about planning would compliment the genomics wrangling, cloud genomics, and genomics workshops. Your storage topic would be good to include, and then reference more storage details in the Cloud Genomics, Data Wrangling, and HPC lessons. There is so much that goes into planning that many take for granted.
But if we go into great detail about storage options, it would be unbalanced relative to other planning details like "What's the project?" "Will you make your own libraries?" "How much coverage do you you need?" "What are the desired questions you want resolved?" "Will you have a lot of samples, or a few?" "How many milligrams of each sample can be produced?" "Are these metagenomic samples?" ... These are all parts of planning and organizing sequencing projects. So maybe that level of detail will have to wait until this lesson is expanded. @mkuzak @Roselynlemusinmegen @analeighgui @raynamharris
Right. I'm honestly not sure where "organization" falls in the lessons right now, but originally this was going to be one of the first things we talked about in genomics, as part of a big "actually plan your bioinformatics like a wet lab experiment" talk/soapbox. I never intended it to be a thing we spent a lot of time on in class, but rather a thing you could reference and let people go back to. We have a similar sort of thing in the cloud lesson: https://datacarpentry.org/cloud-genomics/04-which-cloud/index.html
I still think it would be a nice thing to have this 'decision tree/things to think about/whatever it is' as a link in one of the planning lessons. Even if it didn't get covered explicitly in most workshops, it would be useful for people coming back and reviewing. But obviously it's not a super high priority :)
Arizona BugBBQ: This information is useful but should not take up major real estate in the lesson. Probably a link or two in the lesson would be enough
That would be great
@JCSzamosi and @ACharbonneau et al. This is a stale issue, but it still has important points. I've recovered data for people that was 3-5 years old. But, the most important part of the issue is to make multiple backups of your raw data. The learners don't need to know how to operate an archival system. This is emphasized sufficiently in the genomics lessons. Please reopen if you disagree.
Peter
Is this something that should be referred to the curriculum committee?
Speaking for myself, I don't think so. There is enough emphasis on data protection (keep raw data raw, protecting data through permissions, making backups) it should be clear.
as in http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf