cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Disappearance of TCGA BAMs? #346

Closed ereznik closed 8 years ago

ereznik commented 8 years ago

Hi all,

I just noticed that some of the TCGA bam files previously located in

/cbio/shared/data/tcga/seq

are no longer there, e.g. all the BLCA sequencing data. Does anyone know what happened?

tatarsky commented 8 years ago

I would not know if the act was done by someone on purpose but can check snapshots if we believe there has been an accidental deletion. I will wait however in case somebody has intentionally re-organized or something.

tatarsky commented 8 years ago

BLCA data appears in snapshots preceding today. It appears to have been removed a few days ago. Snapshot rotation stopped.

I am holding here for comment if this was intentional.

akahles commented 8 years ago

I am not aware of any larger intended deletion. However, let me investigate a little more - half of our lab is currently traveling. @tatarsky: Do you have a more specific date on when the data stopped appearing in the snapshots?

tatarsky commented 8 years ago

They are there the 25th (Wednesday) They are not there the 26th (Thursday) Snapshots fire at 4:00AM IIRC.

akahles commented 8 years ago

Thanks for the update. I will take that information and check on our side.

tatarsky commented 8 years ago

BTW, this is 20TB of data. I am showing at least the two snapshot copies I have appear to be the same contents. To be safe we would likely bring it back with a rsync checksum copy and then compare it again before I expire the snapshots. During that time, we'd be a bit lower than I like on free space. So knowing if this was moved or deleted would be very useful before we proceed. I have no validation options beyond that so I leave that to the folks that downloaded it originally.

tatarsky commented 8 years ago

Restoring it per the above. We can't locate a deliberate reason for it. I'm going to suggest perhaps the group write bits be removed from this tree. Not sure why they would be needed but would reduce if accidental or scripted.

akahles commented 8 years ago

I agree with @tatarsky . We can have a smaller, more controlled, tcgawrite group.

tatarsky commented 8 years ago

My initial analysis is showing what appears to me to be a probably mistyped "rm" involving probably "B" and a pattern and recursion. This is based on file name diffs of the snapshots and current. I show besides BLCA that parts (not all) of BRCA is gone from current.

tatarsky commented 8 years ago

rsync -i analysis shows the same. Restoring BLCA first but with -restored at the end. Please do not alter it.

tatarsky commented 8 years ago

BLCA is almost back. Then I will do another comparison before moving it back to its actual directory name.

Then I will start BRCA merger.

tatarsky commented 8 years ago

BLCA rsync is done. Doing a validation one. Will be awhile. If you are being held up severely I can do so with the files in their final location.

tatarsky commented 8 years ago

Is anyone actively in the BRCA directory?

tatarsky commented 8 years ago

BRCA items missing are being brought back in place. No updates seen in active area. Log being kept of items brought back. Will be awhile.

Checksum of BLCA continues but looks good so far. Directory still off to said as BLCA-restored

tatarsky commented 8 years ago

I have actually not run the above for BRCA upon further non-executing review. The list of items it would return I would like to verify with the owner of this data if what I'm seeing represents legit cleaning or a continuation of an accidental rm.

I cannot tell and do not know this data. I can work with somebody more familiar with this data next week or respond to this Git if you use the BRCA sequences.

BLCA is going to checksum for awhile longer and then will be returned.

tatarsky commented 8 years ago

BLCA returned to location. Advise if seems correct.

Could still use somebody that uses BRCA for some review.

akahles commented 8 years ago

Currently traveling with limited access to internet. Can have a look when I am back mid next week.

tatarsky commented 8 years ago

Sounds good @akahles . I will prepare the file of rsync output that shows the returning of data files. When I looked at it closer it was very hard to tell if it (unlike BLCA) was a result of an "oops" or a cleaning.

tatarsky commented 8 years ago

@ereznik I just want to confirm you are back in business with BLCA data...

ereznik commented 8 years ago

Thanks @tatarsky!

On Mon, Dec 7, 2015 at 11:06 AM, tatarsky notifications@github.com wrote:

@ereznik https://github.com/ereznik I just want to confirm you are back in business with BLCA data...

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/346#issuecomment-162569371.

akahles commented 8 years ago

@tatarsky I am back now. Just send me the rsync output and I'll have a look.

tatarsky commented 8 years ago

Welcome back. I have placed an rsync 'itemize format' file at: /cbio/shared/data/tcga/seq/brca.rsync.20151127 which contains the files that would be restored if I brought back the 11/27 snapshot into the still existing BRCA directory.

Unlike BLCA (which was gone completely) this one has a different "feel" when you look at the log.

Its a fairly considerable file size. The summary of the flags at the start is in man rsync if you scroll down to --itemize-changes.

tatarsky commented 8 years ago

The trick BTW is restoring the data will drop us down pretty low. And I can't free the snapshots I need to free to get some of that back until its done....

akahles commented 8 years ago

Finally went through the list of directories that were deleted and did not find a reason to not restore them. All of the associated IDs are live on CGHub (not redacted) and thus should be on our system. Sorry that this took so long ...

tatarsky commented 8 years ago

No problem. I will start the rsync but its going to I believe get a bit low on space as I can't release the snapshot until its done and since they were deleted there is no "existing" inode. (if I understand the way it works correctly).

So any help anyone can do clearing other space right now would be appreciated as we are a bit low.

tatarsky commented 8 years ago

And in the end some improved permissions are needed. If this was a accidental rm it was a big one.

tatarsky commented 8 years ago

This initial pass BTW will run with the rsync "u" flag preventing any newer file from being rolled back. I am comparing a non-executing run output to see if thats really a concern.

tatarsky commented 8 years ago

Underway with no update clobbering. Will be quite some time.

tatarsky commented 8 years ago

The rsync is done and one more validation run is done. If I were to allow it to rollback a modified file it would consist of just this one which I show was modified on the 4th:

RNA-Seq/bam/TCGA-A2-A04Y-01A-21R-A034-07.7d01a8b7-29fb41ae-b605-16a0bda8d4ee.v0.cghub.bam

I will likely save the old copy of that with a suffix to compare.

I badly need to release the snapshots now holding the old inode copy to get back some space so advise when you think that would be possible. I have no other way than the rsync to validate the contents.

I would like to then strongly consider changing the permissions on most of this tree.

akahles commented 8 years ago

Thanks for taking care of this @tatarsky. For the file above, please use the version that is in the snapshots. I have three reasons to suggest this: 1) the file modified on the 4th is only 4GB (vs. 9.8GB in snapshots) and seems trunkated 2) the checksum of the snapshot version (3403631a2976146e8fa63ae4758d8fb6) matches the one that is associated with the original file on cghub (check with cgquery "analysis_id=7d01a8b7-29fb41ae-b605-16a0bda8d4ee") and the newer one does not 3) Nobody should actually modify these files anyway ...

tatarsky commented 8 years ago

Sounds good. Will do that in a moment.

tatarsky commented 8 years ago

Restored. Old file has a "asof201512141000" suffix I can delete if desired.

Please recheck. And then just give me a sign I can release the snapshot. Once released however, there is no other archive of this data known to me. So be clear on that but I also need the space back.

akahles commented 8 years ago

Just to confirm. The only scenario for loss of any data would be that the rsync from the snapshots over to the current system was incomplete? If so, I am fine with releasing the snapshots. In the worst case all the data lives on cghub and can be re-downloaded. We can wait a little longer for other's to voice their opinion, but from my side it should be fine.

tatarsky commented 8 years ago

To the best of my knowledge your comment on the rsync is correct. Bearing in mind I've run several validation ones.

I'll wait until the end of the day. We've got 60TB but we're trying to keep free space a bit more than that to avoid excitement.

Thanks for the help @akahles !

tatarsky commented 8 years ago

Dropping the older snapshots in an hour. (If I was unclear what I meant by end of day, I can hold. I usually mean end of business day and should have said so).

tatarsky commented 8 years ago

Snapshots deleted. Back to "regular" space. Closing this for now. If need aid in permission improvements advise via another issue.