Closed ratsch closed 9 years ago
Hi Gunnar,
Thanks for the heads up. I unfortunately need all the cancer exomes for my analysis. I'd be happy to not have the data and wait for a couple of weeks for this issue to be resolved. However if it's going to take 1-2 months, I am wondering if we can figure out an alternative plan. Would transferring only the whole genomes free up enough space? Thanks.
On Fri, Sep 19, 2014 at 4:02 PM, Gunnar Ratsch notifications@github.com wrote:
In order to deal with the increased drive failure situation, we are planning to transfer the cancer exomes from UCEC, THCA, COAD, GBM, LUAD, KIRC, BRCA and OV and the cancer whole genomes from LUAD, LUSC, THCA, HNSC to a location that will not be accessible from the cluster. The rationale is that we need to free up space in order to fully replicate the file system in order to reduce the risk of data loss. Please let me know if you specifically need this data and we try to figure out an alternative plan.
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116.
The hope is that we can get back to normal in 2-3 weeks. But this is conditioned on Dell’s quick response.. We have started copying off these files. I suggest that we go ahead now (since we have to act quickly) and adapt the plan in case there is a delay.
On Sep 19, 2014, at 4:33 PM, rj67 notifications@github.com wrote:
Hi Gunnar,
Thanks for the heads up. I unfortunately need all the cancer exomes for my analysis. I'd be happy to not have the data and wait for a couple of weeks for this issue to be resolved. However if it's going to take 1-2 months, I am wondering if we can figure out an alternative plan. Would transferring only the whole genomes free up enough space? Thanks.
On Fri, Sep 19, 2014 at 4:02 PM, Gunnar Ratsch notifications@github.com wrote:
In order to deal with the increased drive failure situation, we are planning to transfer the cancer exomes from UCEC, THCA, COAD, GBM, LUAD, KIRC, BRCA and OV and the cancer whole genomes from LUAD, LUSC, THCA, HNSC to a location that will not be accessible from the cluster. The rationale is that we need to free up space in order to fully replicate the file system in order to reduce the risk of data loss. Please let me know if you specifically need this data and we try to figure out an alternative plan.
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116.
— Reply to this email directly or view it on GitHub.
That sounds good with me. Thanks
On Sun, Sep 21, 2014 at 2:26 PM, Gunnar Ratsch notifications@github.com wrote:
The hope is that we can get back to normal in 2-3 weeks. But this is conditioned on Dell’s quick response.. We have started copying off these files. I suggest that we go ahead now (since we have to act quickly) and adapt the plan in case there is a delay.
On Sep 19, 2014, at 4:33 PM, rj67 notifications@github.com wrote:
Hi Gunnar,
Thanks for the heads up. I unfortunately need all the cancer exomes for my analysis. I'd be happy to not have the data and wait for a couple of weeks for this issue to be resolved. However if it's going to take 1-2 months, I am wondering if we can figure out an alternative plan. Would transferring only the whole genomes free up enough space? Thanks.
On Fri, Sep 19, 2014 at 4:02 PM, Gunnar Ratsch notifications@github.com
wrote:
In order to deal with the increased drive failure situation, we are planning to transfer the cancer exomes from UCEC, THCA, COAD, GBM, LUAD, KIRC, BRCA and OV and the cancer whole genomes from LUAD, LUSC, THCA, HNSC to a location that will not be accessible from the cluster. The rationale is that we need to free up space in order to fully replicate the file system in order to reduce the risk of data loss. Please let me know if you specifically need this data and we try to figure out an alternative plan.
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116#issuecomment-56307708.
Is there any update on when the data will be returned to the cluster?
We are still in the drive remediation phase which will be awhile. I'll have a better estimate of what "awhile" really is by the end of the week. We are migrating entire LUNS from the suspect WD drives to new arrays with replacement non-suspect drives but the rate varies by regular cluster load and our settings.
However, following that a core decision about unreplicated data needs to be made and we are still discussing that with the PIs for the future. That will be the primary item post getting off these drives that will answer your question.
Sorry for the "non-answer" but I wanted you to know I saw it and that I don't have a solid ETA.
@tatarsky -- do we have a clearer answer on this question yet?
No
I will make a GUESS when I finish 4-11. I'm not in any way committing to that GUESS being correct because its variant times in each run depending on the cluster I/O.
Our guess is end of the month for phase 1 to be complete. Bearing in mind the time per LUN has been quite variant depending on other load. Any reduction is consumed disk space speeds the process.
We are in discussions with how to balance the second phase (bringing failure group 1 to all safe drives), expansion of the filesystem and/or how to deal with non-replicated data in a safe manner.
However, when these last five LUNS are done, we feel at least all replicated data is solidly on one failure group with no suspect drives.
At the risk of sounding stupid, could I ask for a little exposition on what the different between phases 1 and 2 are? Does phase 1 being complete correspond to the accessibility of some of the data?
Phase one complete means replicated data is unlikely to be lost in event of bulk continued failures of the bad drives. Having that allows us to discuss or speed the process of either expanding the filesystem or risking some un-replication as a way to free space to bring back some data we removed to get all files replicated (and thus safe).
However, until all Western Digital drives are removed (Phase 2) there is a statistical chance that a LUN failure due to three drives in the same LUN failing will result in UNREPLICATED data loss. But we believe we can do phase 2 quicker because we do not have to worry about bulk failures in at least five of the arrays.
We have no unreplicated data at this time on the filesystem. Which is why its very full.
We prefer to operate with only replicated data. But your cost and space requirements may require you unreplicate some items. Normally loss of a LUN is a very rare event. But with the defective drives, we don't feel its that unlikely.
So phase 1 being complete allows discussion of how best to proceed without worrying every day about basically losing all the data.
Thanks @tatarsky that was very clear.
Sure. Its a complex op to currently just get us out of daily failure fear. The day we had 8 drives go in less than 10 hours was basically as close as I want to get to LUN loss.
Phase one complete. Phase two in progress for last few days. ETA shortly.
Yeah!
On Dec 1, 2014, at 11:35 AM, tatarsky notifications@github.com wrote:
Phase one complete. Phase two in progress for last few days. ETA shortly.
— Reply to this email directly or view it on GitHub.
Gunnar Rätsch Sloan Kettering Institute Contact info: http://goo.gl/nnRkb http://twitter.com/gxr
Hooray!
And there was much rejoicing... I am working on the fastest method for FG1 migration as I type.
One array of the three in failure group one is now done. Next steps are physical in nature and involve drive replacements. Work will likely begin on that tomorrow. Once done migration of those last two arrays will begin.
There is NO migration I/O until that step completes.
We are close to starting Phase 2 which is the migration off the last parts of Failure Group 1 (FG1) which are still on Western Digital drives. As a special bonus we are actually already 33% done with this process due to an array that was not shipped with Western Digital drives in the first place.
You will see a bump in apparent disk space as the method we use for faster and lower impact requires we add the new LUNS and then migrate the old ones to them in parallel. Please do not take that as an opportunity to add considerable levels of data (as you will still hit the remaining space limit in FG2 and be sad).
A graph of the per failure group space is available here:
If in doubt, please simply ask us and we will provide information about the remaining space per failure group and provide you advice on what level of use is safe. Thank you for your patience.
Phase 2 is 50% done. We are pushing it a bit this week so if you feel performance is slow please comment and we will adjust the settings. Our goal for this remaining portion is during regular business hours when we have coverage of drive swaps we would like to get more data migrated.
We are on the last three LUNS to be migrated. Representing 30 drives. The speed of this last migration is primarily a balance of that data being copied by GPFS routines to the new drives and regular GPFS I/O.
It is proceeding fairly well. I am estimating Sunday or Monday completion. Reduced competing I/O will shorten that estimate. Once finished, the restore of removed items will begin but without the fear of shedding drives. Again, thank you for your patience during the process.
To hopefully make your Friday a little nicer....
30 minutes ago the last block of data left the Western Digital containing arrays.
The migration is over. ~300 suspect bad batch drives have been rotated out of the GPFS filesystem.
Additional steps are now being formulated to return data to the filesystem in an orderly way. Cheers.
Hooray @tatarsky ! Any chance we can get an ETA (or pseudo-order-of-magnitude ETA) on the return of data?
Thanks @tatarsky for all your efforts!
You are all welcome. Precise ETA is difficult but the steps are now basically:
unreplicate to free some space (requires coordination with data owners) rsync from several locations the relocated items in some order of priority
These steps are being worked on as I type.
ETA is mostly limited to speeds of the rsync which was pretty good.
Any update on restoration of the data?
While I am not directly involved in the process, for the past several days I have been watching rsyncs back from the CBIO units the data was moved to. I believe the data is being placed in a staging area for validation before being returned to its original path. But again, I'm not handling that part. So I'm commenting so you know this part and perhaps those doing the transfer can elaborate.
@ratsch ?
The data is currently restored to /cbio/shared/overflow About 75% of the data is back. Once the transfer is complete, we will reintegrate it into the main tree.
If you find data there, feel free to use it after checking the files are complete. Please note, however, that the file location is temporary.
On Jan 7, 2015, at 5:30 PM, John Chodera notifications@github.com wrote:
@ratsch https://github.com/ratsch ?
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116#issuecomment-69104392.
We believe the data rsync is complete but validation needs to be done before it is moved back into place. I am placing this here just to note that status I was told yesterday.
Just wondering if there is any update on the status of the restoration process. Seems to me most of the cancer exomes in the shared_bam folder are back in place, except for OV and UCEC. I am wondering if those files are yet to be restored or permanently lost due to disk failure.
No data I know of was ever permanently lost due to replication. I am of the impression @ratsch continues to rsync another collection of items from another location. I see directories with the names "OV" and "UCEC" in the area I am monitoring but I will defer to him on the state of the rsync.
Thanks for the clarification. The cancer exome files are in the "WXS" directory under each cancer study. The "OV" and "UCEC" studies currently don't have the "WXS" directory.
I assume because the rsync is not done. And since I am not doing the rsync, we'll have to see if an update is provided.
The data is currently being transferred to the directory /cbio/shared/overflow/plfah2. They repeated stalled and this leads to a delay in their transfer back to the system. There are about 55TB that are still on that system (space graciously provided by the chodera lab) and it will likely take another week to get the data back.
On Feb 9, 2015, at 1:22 PM, rj67 notifications@github.com wrote:
Thanks for the clarification. The cancer exome files are in the "WXS" directory under each cancer study. The "OV" and "UCEC" studies currently don't have the "WXS" directory.
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116#issuecomment-73560726.
Thanks for the update
It seems like the UCEC and OV WXS files are back, but the permissions on the OV data are under the "raetsch" group. Any chance someone could kindly modify the permissions to readable for the TCGA group?
@ratsch confirm you are indeed done and ready for that mod and I can get it.
Corrected permission per conversation in email. Advise if you need further changes!
I believe this is completed. Can you confirm @ratsch ? Would be pleasant to end this tale of woe.
Endless disk-replacements of more than 400 disks and copying of >500TB of data back and forth across different systems are now finished. No data was lost and we are back to normal operations. Finally! Snapshots are operational, but not yet on xxlab/projects and xxlab/share.
Thanks to everybody involved, in particular to @tartarsky, @knospler and @bubble1975 for their heroic efforts to make make this happen without loosing any data!!!
Happy crunching!
On Feb 26, 2015, at 7:43 AM, tatarsky notifications@github.com wrote:
I believe this is completed. Can you confirm @ratsch https://github.com/ratsch ? Would be pleasant to end this tale of woe.
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116#issuecomment-76171767.
And with that, I will thank @ratsch for the thanks, say I appreciated everyones patience and close this Git issue.
In order to deal with the increased drive failure situation, we are planning to transfer the cancer exomes from UCEC, THCA, COAD, GBM, LUAD, KIRC, BRCA and OV and the cancer whole genomes from LUAD, LUSC, THCA, HNSC to a location that will not be accessible from the cluster. The rationale is that we need to free up space in order to fully replicate the file system in order to reduce the risk of data loss. Please let me know if you specifically need this data and we try to figure out an alternative plan.