CCRGeneticsBranch / khanlab_ngs_pipeline

0 stars 1 forks source link

Potential Data Inconsistencies on One Biowulf Filesystem #12

Closed slsevilla closed 1 year ago

slsevilla commented 2 years ago

Email sent from Biowulf staff on 11/9

Hello,

You are receiving this e-mail because you have access to one or more data directories on the newest HPC all-flash storage system. The HPC staff has recently identified an issue with this filesystem in which files that are written to disk and then immediately re-read may be corrupted when written to disk. The issue arose in an upgrade to the filesystem code that was performed on September 15; only files created on or after this date could potentially be affected.

To determine which data directory or directories to which you have access, run:
checkquota -a
and look for directories that begin with "/vf"; these are the ones that are potentially affected.

Please check any data and results in these directories that might have been read immediately after being written, e.g. if data is produced in a script and then immediately read by another process such as a sort. From what we have seen, data corruption can take the form of missing new lines or binary data in text files.

If you identify any corrupt data, please notify the HPC staff via e-mail at [staff@hpc.nih.gov](mailto:staff@hpc.nih.gov) and go ahead and re-run the job or pipeline from the last known-good step.

A significant portion of samples (~2000) were run during this period and on the file system /vf/.

Solution: A sample list will be created to determine which samples were affected. All samples will be re-run on either a system not related to /vf/ OR will be run after patching is completed on Biowulf. Tracking of the systems is done here: https://hpc.nih.gov/docs/vf_corruption_2022.html

slsevilla commented 1 year ago

A list was generated and samples are currently being re-run by Xinyu.

As of Monday 11/28 all systems have been patched so this is no longer a concern.

See email below:

Hello,

Summary: The software which could result in file inconsistencies on one Biowulf filesystem (/vf) has been patched system wide.

Details: On November 9, you received an e-mail from the HPC staff regarding potential data inconsistencies on one Biowulf filesystem. Over the last two weeks, the HPC staff has been working diligently to patch the problematic system software that caused the issue. We are pleased to report that this work is now complete; all nodes that are running user jobs have received the patch. The /remediated_vf file will be removed from compute nodes during the next HPC downtime, which is scheduled for December 8-11.

The HPC staff recognizes how valuable your data is to your work and sincerely apologizes for the disruption to your research. We have been working closely with the filesystem vendor to convey how serious this problem was and determine a strategy to prevent any recurrence.  

Users who are concerned that they may have been impacted by this problem are encouraged to consult https://hpc.nih.gov/docs/vf_corruption_2022.html for information on how to identify the problem and to contact [staff@hpc.nih.gov](mailto:staff@hpc.nih.gov) with any questions.