cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Data that will be parked outside of hal for 1-2 months. Please comment. #116

Closed ratsch closed 9 years ago

ratsch commented 10 years ago

In order to deal with the increased drive failure situation, we are planning to transfer the cancer exomes from UCEC, THCA, COAD, GBM, LUAD, KIRC, BRCA and OV and the cancer whole genomes from LUAD, LUSC, THCA, HNSC to a location that will not be accessible from the cluster. The rationale is that we need to free up space in order to fully replicate the file system in order to reduce the risk of data loss. Please let me know if you specifically need this data and we try to figure out an alternative plan.

rj67 commented 10 years ago

Hi Gunnar,

Thanks for the heads up. I unfortunately need all the cancer exomes for my analysis. I'd be happy to not have the data and wait for a couple of weeks for this issue to be resolved. However if it's going to take 1-2 months, I am wondering if we can figure out an alternative plan. Would transferring only the whole genomes free up enough space? Thanks.

On Fri, Sep 19, 2014 at 4:02 PM, Gunnar Ratsch notifications@github.com wrote:

In order to deal with the increased drive failure situation, we are planning to transfer the cancer exomes from UCEC, THCA, COAD, GBM, LUAD, KIRC, BRCA and OV and the cancer whole genomes from LUAD, LUSC, THCA, HNSC to a location that will not be accessible from the cluster. The rationale is that we need to free up space in order to fully replicate the file system in order to reduce the risk of data loss. Please let me know if you specifically need this data and we try to figure out an alternative plan.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116.

ratsch commented 10 years ago

The hope is that we can get back to normal in 2-3 weeks. But this is conditioned on Dell’s quick response.. We have started copying off these files. I suggest that we go ahead now (since we have to act quickly) and adapt the plan in case there is a delay.

On Sep 19, 2014, at 4:33 PM, rj67 notifications@github.com wrote:

Hi Gunnar,

Thanks for the heads up. I unfortunately need all the cancer exomes for my analysis. I'd be happy to not have the data and wait for a couple of weeks for this issue to be resolved. However if it's going to take 1-2 months, I am wondering if we can figure out an alternative plan. Would transferring only the whole genomes free up enough space? Thanks.

On Fri, Sep 19, 2014 at 4:02 PM, Gunnar Ratsch notifications@github.com wrote:

In order to deal with the increased drive failure situation, we are planning to transfer the cancer exomes from UCEC, THCA, COAD, GBM, LUAD, KIRC, BRCA and OV and the cancer whole genomes from LUAD, LUSC, THCA, HNSC to a location that will not be accessible from the cluster. The rationale is that we need to free up space in order to fully replicate the file system in order to reduce the risk of data loss. Please let me know if you specifically need this data and we try to figure out an alternative plan.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116.

— Reply to this email directly or view it on GitHub.

rj67 commented 10 years ago

That sounds good with me. Thanks

On Sun, Sep 21, 2014 at 2:26 PM, Gunnar Ratsch notifications@github.com wrote:

The hope is that we can get back to normal in 2-3 weeks. But this is conditioned on Dell’s quick response.. We have started copying off these files. I suggest that we go ahead now (since we have to act quickly) and adapt the plan in case there is a delay.

On Sep 19, 2014, at 4:33 PM, rj67 notifications@github.com wrote:

Hi Gunnar,

Thanks for the heads up. I unfortunately need all the cancer exomes for my analysis. I'd be happy to not have the data and wait for a couple of weeks for this issue to be resolved. However if it's going to take 1-2 months, I am wondering if we can figure out an alternative plan. Would transferring only the whole genomes free up enough space? Thanks.

On Fri, Sep 19, 2014 at 4:02 PM, Gunnar Ratsch notifications@github.com

wrote:

In order to deal with the increased drive failure situation, we are planning to transfer the cancer exomes from UCEC, THCA, COAD, GBM, LUAD, KIRC, BRCA and OV and the cancer whole genomes from LUAD, LUSC, THCA, HNSC to a location that will not be accessible from the cluster. The rationale is that we need to free up space in order to fully replicate the file system in order to reduce the risk of data loss. Please let me know if you specifically need this data and we try to figure out an alternative plan.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116#issuecomment-56307708.

ereznik commented 10 years ago

Is there any update on when the data will be returned to the cluster?

tatarsky commented 10 years ago

We are still in the drive remediation phase which will be awhile. I'll have a better estimate of what "awhile" really is by the end of the week. We are migrating entire LUNS from the suspect WD drives to new arrays with replacement non-suspect drives but the rate varies by regular cluster load and our settings.

However, following that a core decision about unreplicated data needs to be made and we are still discussing that with the PIs for the future. That will be the primary item post getting off these drives that will answer your question.

tatarsky commented 10 years ago

Sorry for the "non-answer" but I wanted you to know I saw it and that I don't have a solid ETA.

rgejman commented 10 years ago

@tatarsky -- do we have a clearer answer on this question yet?

jchodera commented 10 years ago

No

tatarsky commented 10 years ago

I will make a GUESS when I finish 4-11. I'm not in any way committing to that GUESS being correct because its variant times in each run depending on the cluster I/O.

tatarsky commented 10 years ago

Our guess is end of the month for phase 1 to be complete. Bearing in mind the time per LUN has been quite variant depending on other load. Any reduction is consumed disk space speeds the process.

We are in discussions with how to balance the second phase (bringing failure group 1 to all safe drives), expansion of the filesystem and/or how to deal with non-replicated data in a safe manner.

However, when these last five LUNS are done, we feel at least all replicated data is solidly on one failure group with no suspect drives.

ereznik commented 10 years ago

At the risk of sounding stupid, could I ask for a little exposition on what the different between phases 1 and 2 are? Does phase 1 being complete correspond to the accessibility of some of the data?

tatarsky commented 10 years ago

Phase one complete means replicated data is unlikely to be lost in event of bulk continued failures of the bad drives. Having that allows us to discuss or speed the process of either expanding the filesystem or risking some un-replication as a way to free space to bring back some data we removed to get all files replicated (and thus safe).

However, until all Western Digital drives are removed (Phase 2) there is a statistical chance that a LUN failure due to three drives in the same LUN failing will result in UNREPLICATED data loss. But we believe we can do phase 2 quicker because we do not have to worry about bulk failures in at least five of the arrays.

We have no unreplicated data at this time on the filesystem. Which is why its very full.

We prefer to operate with only replicated data. But your cost and space requirements may require you unreplicate some items. Normally loss of a LUN is a very rare event. But with the defective drives, we don't feel its that unlikely.

So phase 1 being complete allows discussion of how best to proceed without worrying every day about basically losing all the data.

ereznik commented 10 years ago

Thanks @tatarsky that was very clear.

tatarsky commented 10 years ago

Sure. Its a complex op to currently just get us out of daily failure fear. The day we had 8 drives go in less than 10 hours was basically as close as I want to get to LUN loss.

tatarsky commented 10 years ago

Phase one complete. Phase two in progress for last few days. ETA shortly.

ratsch commented 10 years ago

Yeah!

On Dec 1, 2014, at 11:35 AM, tatarsky notifications@github.com wrote:

Phase one complete. Phase two in progress for last few days. ETA shortly.

— Reply to this email directly or view it on GitHub.

Gunnar Rätsch Sloan Kettering Institute Contact info: http://goo.gl/nnRkb http://twitter.com/gxr

jchodera commented 10 years ago

Hooray!

tatarsky commented 10 years ago

And there was much rejoicing... I am working on the fastest method for FG1 migration as I type.

tatarsky commented 10 years ago

One array of the three in failure group one is now done. Next steps are physical in nature and involve drive replacements. Work will likely begin on that tomorrow. Once done migration of those last two arrays will begin.

There is NO migration I/O until that step completes.

tatarsky commented 9 years ago

We are close to starting Phase 2 which is the migration off the last parts of Failure Group 1 (FG1) which are still on Western Digital drives. As a special bonus we are actually already 33% done with this process due to an array that was not shipped with Western Digital drives in the first place.

You will see a bump in apparent disk space as the method we use for faster and lower impact requires we add the new LUNS and then migrate the old ones to them in parallel. Please do not take that as an opportunity to add considerable levels of data (as you will still hit the remaining space limit in FG2 and be sad).

A graph of the per failure group space is available here:

http://goo.gl/BLfVbZ

If in doubt, please simply ask us and we will provide information about the remaining space per failure group and provide you advice on what level of use is safe. Thank you for your patience.

tatarsky commented 9 years ago

Phase 2 is 50% done. We are pushing it a bit this week so if you feel performance is slow please comment and we will adjust the settings. Our goal for this remaining portion is during regular business hours when we have coverage of drive swaps we would like to get more data migrated.

tatarsky commented 9 years ago

We are on the last three LUNS to be migrated. Representing 30 drives. The speed of this last migration is primarily a balance of that data being copied by GPFS routines to the new drives and regular GPFS I/O.

It is proceeding fairly well. I am estimating Sunday or Monday completion. Reduced competing I/O will shorten that estimate. Once finished, the restore of removed items will begin but without the fear of shedding drives. Again, thank you for your patience during the process.

tatarsky commented 9 years ago

To hopefully make your Friday a little nicer....

30 minutes ago the last block of data left the Western Digital containing arrays.

The migration is over. ~300 suspect bad batch drives have been rotated out of the GPFS filesystem.

Additional steps are now being formulated to return data to the filesystem in an orderly way. Cheers.

ereznik commented 9 years ago

Hooray @tatarsky ! Any chance we can get an ETA (or pseudo-order-of-magnitude ETA) on the return of data?

akahles commented 9 years ago

Thanks @tatarsky for all your efforts!

tatarsky commented 9 years ago

You are all welcome. Precise ETA is difficult but the steps are now basically:

unreplicate to free some space (requires coordination with data owners) rsync from several locations the relocated items in some order of priority

These steps are being worked on as I type.

ETA is mostly limited to speeds of the rsync which was pretty good.

ereznik commented 9 years ago

Any update on restoration of the data?

tatarsky commented 9 years ago

While I am not directly involved in the process, for the past several days I have been watching rsyncs back from the CBIO units the data was moved to. I believe the data is being placed in a staging area for validation before being returned to its original path. But again, I'm not handling that part. So I'm commenting so you know this part and perhaps those doing the transfer can elaborate.

jchodera commented 9 years ago

@ratsch ?

ratsch commented 9 years ago

The data is currently restored to /cbio/shared/overflow About 75% of the data is back. Once the transfer is complete, we will reintegrate it into the main tree.

If you find data there, feel free to use it after checking the files are complete. Please note, however, that the file location is temporary.

On Jan 7, 2015, at 5:30 PM, John Chodera notifications@github.com wrote:

@ratsch https://github.com/ratsch ?

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116#issuecomment-69104392.

tatarsky commented 9 years ago

We believe the data rsync is complete but validation needs to be done before it is moved back into place. I am placing this here just to note that status I was told yesterday.

rj67 commented 9 years ago

Just wondering if there is any update on the status of the restoration process. Seems to me most of the cancer exomes in the shared_bam folder are back in place, except for OV and UCEC. I am wondering if those files are yet to be restored or permanently lost due to disk failure.

tatarsky commented 9 years ago

No data I know of was ever permanently lost due to replication. I am of the impression @ratsch continues to rsync another collection of items from another location. I see directories with the names "OV" and "UCEC" in the area I am monitoring but I will defer to him on the state of the rsync.

rj67 commented 9 years ago

Thanks for the clarification. The cancer exome files are in the "WXS" directory under each cancer study. The "OV" and "UCEC" studies currently don't have the "WXS" directory.

tatarsky commented 9 years ago

I assume because the rsync is not done. And since I am not doing the rsync, we'll have to see if an update is provided.

ratsch commented 9 years ago

The data is currently being transferred to the directory /cbio/shared/overflow/plfah2. They repeated stalled and this leads to a delay in their transfer back to the system. There are about 55TB that are still on that system (space graciously provided by the chodera lab) and it will likely take another week to get the data back.

On Feb 9, 2015, at 1:22 PM, rj67 notifications@github.com wrote:

Thanks for the clarification. The cancer exome files are in the "WXS" directory under each cancer study. The "OV" and "UCEC" studies currently don't have the "WXS" directory.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116#issuecomment-73560726.

rj67 commented 9 years ago

Thanks for the update

ereznik commented 9 years ago

It seems like the UCEC and OV WXS files are back, but the permissions on the OV data are under the "raetsch" group. Any chance someone could kindly modify the permissions to readable for the TCGA group?

tatarsky commented 9 years ago

@ratsch confirm you are indeed done and ready for that mod and I can get it.

tatarsky commented 9 years ago

Corrected permission per conversation in email. Advise if you need further changes!

tatarsky commented 9 years ago

I believe this is completed. Can you confirm @ratsch ? Would be pleasant to end this tale of woe.

ratsch commented 9 years ago

Endless disk-replacements of more than 400 disks and copying of >500TB of data back and forth across different systems are now finished. No data was lost and we are back to normal operations. Finally! Snapshots are operational, but not yet on xxlab/projects and xxlab/share.

Thanks to everybody involved, in particular to @tartarsky, @knospler and @bubble1975 for their heroic efforts to make make this happen without loosing any data!!!

Happy crunching!

On Feb 26, 2015, at 7:43 AM, tatarsky notifications@github.com wrote:

I believe this is completed. Can you confirm @ratsch https://github.com/ratsch ? Would be pleasant to end this tale of woe.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/116#issuecomment-76171767.

tatarsky commented 9 years ago

And with that, I will thank @ratsch for the thanks, say I appreciated everyones patience and close this Git issue.