NuGet / Home

Repo for NuGet Client issues
Other
1.5k stars 253 forks source link

Improve package extraction robustness #5438

Open emgarten opened 7 years ago

emgarten commented 7 years ago

Restore has a specific order that it writes out package files, the final hash file marks that the package has been 100% extracted and written to disk when restore does the final move to put the hash file in place.

This can go wrong when the machine crashes while restore running. The result is that files are the correct size but contain all nulls. This is because restore wrote the file, but the file was not flushed and written fully to the actual disk. (Found by the Roslyn team)

To better handle restore should investigate ways to flush files to disk before writing the final hash file, this would ensure that if a crash happens that the marker is not in place and that the next restore will overwrite these files.

This problem shows up as `'.', hexadecimal value 0x00, is an invalid character. Line 1, position 1.`` which is caused by the nuspec file containing all null characters.

emgarten commented 6 years ago

I'm not sure that there is a good way to fix this without degrading performance.

tmat commented 4 years ago

Have you considered an alternative approach - to detect the corruption and recover from it (e.g. delete the file and recreate the cache)? That should not cause any perf regressions during normal operation.

tmat commented 4 years ago

We are hitting this issue again, btw.

zkat commented 4 years ago

Hey @tmat -- thanks for resurfacing this issue. I actually really agree with what you're saying, and that's exactly what the NPM CLI ended up doing -- detecting cache corruption and re-fetching packages automatically. I'll be picking up this task as soon as I'm done with the current thing this sprint, but we can start talking about it already if you have any other thoughts!

tmat commented 4 years ago

@zkat Thanks! No other thoughts right now beyond the above comment.

nkolev92 commented 4 years ago

We usually stayed away from detecting global package folder inconsistencies due to the perf impact of reading the zip/parts of the zip. The assets file creation even reuses the file list when rebuilding the assets file to avoid the extra IO.

I think that's why the original proposal is to tackle it at package installation time into the global packages folder.

Ofc we can profile all of this again :)

tmat commented 4 years ago

You are now throwing an exception when reading the file. The suggestion is to perform validation/cache cleanup only when you hit such an unexpected file content format error.

nkolev92 commented 4 years ago

You are now throwing an exception when reading the file.

That's only if the nuspec is the corrupt file. Right now, NuGet will not try to read any other package content. I'd imagine we'd need to validate all the files.

Then again, maybe addressing the specific repro, and just fixing up the installation directory if the nuspec is not correct extracted is enough.

tmat commented 4 years ago

Another option might be to extract the files to a directory with a _tmp suffix and then rename the directory after flushing the file content.

nkolev92 commented 4 years ago

Another option might be to extract the files to a directory with a _tmp suffix and then rename the directory after flushing the file content.

That might do it. We already do that on a file level but for a different reason (assets files, and the rest of the NuGet restore outputs).

jasonmalinowski commented 4 years ago

Wasn't the original concern that flushing the files had a perf impact? Otherwise there was an easier fix if I recall?

tmat commented 4 years ago

Not flushing the files is faster but incorrect.

jasonmalinowski commented 4 years ago

Sorry I should have been clearer: if you're going to flush, I don't think you need a temp folder per @emgarten's original description unless things have changed since then.

tmat commented 4 years ago

I think you still need the temp dir. The process might crash after Windows starts writing the file on disk but before it's flushed. The purpose of temp dir is to have an atomic "publish" operation that makes the content available for other processes on the machine.

jasonmalinowski commented 4 years ago

Per @emgarten's description, that's what the .sha512 file was supposed to be, but failed since NTFS can flush out of order if they weren't explicit about it.

ghost commented 8 months ago

We have also encountered the same issue and would like to see more robust behavior in ensuring the consistency of package contents after unpacking. The subsequent manipulation or the corrupt state after the aborted unpacking must be revealed and corrected during the restore