dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.34k stars 4.74k forks source link

Tar: archive creation should detect hard links to same file #74404

Open tmds opened 2 years ago

tmds commented 2 years ago

Currently hard links to the same file get duplicated in the archive. Instead, when additional hard links to the same file are encountered, they should be stored as hard links to the first entry.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-io See info in area-owners.md if you want to be subscribed.

Issue Details
Currently hard links to the same file get duplicated in the archive. Instead, when additional hard links to the same file are encountered, they should be stored as hard links to the first entry.
Author: tmds
Assignees: -
Labels: `area-System.IO`
Milestone: Future
tmds commented 2 years ago

cc @carlossanlop

carlossanlop commented 2 years ago

How do you differentiate a regular file from a hard link?

tmds commented 2 years ago

How do you differentiate a regular file from a hard link?

Once the hard link is created the resulting path is no different from the path it was created from. Both paths now have a strong reference to the file.

When you stat, st_nlink contains the nr of hard links. When there are multiple hard links, it will be higher than 1. Paths to the same file have the same st_ino.

tmds commented 2 years ago

For example:

Create a file:

touch file

Create a hard link:

ln file file2

Both of these are valid paths for the file. They both register as regular files. Notice the 2 in the output of ls which is the nr of hard links.

$ ls -lah
total 0
drwxr-xr-x.  2 tmds tmds   80 Aug 23 13:36 .
drwxrwxrwt. 43 root root 1.4K Aug 23 13:22 ..
-rw-r--r--.  2 tmds tmds    0 Aug 23 13:36 file
-rw-r--r--.  2 tmds tmds    0 Aug 23 13:36 file2

They have the same inode nr:

$ stat file file2
  File: file
  Size: 0           Blocks: 0          IO Block: 4096   regular empty file
Device: 0,41    Inode: 13530       Links: 2
Access: (0644/-rw-r--r--)  Uid: ( 1000/    tmds)   Gid: ( 1000/    tmds)
Context: unconfined_u:object_r:user_tmp_t:s0
Access: 2022-08-23 13:36:05.514551062 +0200
Modify: 2022-08-23 13:36:05.514551062 +0200
Change: 2022-08-23 13:36:08.184538995 +0200
 Birth: 2022-08-23 13:36:05.514551062 +0200
  File: file2
  Size: 0           Blocks: 0          IO Block: 4096   regular empty file
Device: 0,41    Inode: 13530       Links: 2
Access: (0644/-rw-r--r--)  Uid: ( 1000/    tmds)   Gid: ( 1000/    tmds)
Context: unconfined_u:object_r:user_tmp_t:s0
Access: 2022-08-23 13:36:05.514551062 +0200
Modify: 2022-08-23 13:36:05.514551062 +0200
Change: 2022-08-23 13:36:08.184538995 +0200
 Birth: 2022-08-23 13:36:05.514551062 +0200
KalleOlaviNiemitalo commented 2 years ago

Paths to the same file have the same st_ino.

And the same st_dev.

For Win32, there is FindFirstFileNameW, but I don't know whether it works with all remote file systems (SMB, NFS, WSL2), and the results might be difficult to use if symbolic links to directories are involved. There is also DWORD NumberOfLinks in FILE_STANDARD_INFO, LARGE_INTEGER FileId in FILE_ID_BOTH_DIR_INFO, and FILE_ID_128 FileId in FILE_ID_INFO or FILE_ID_EXTD_DIR_INFO. Of these, FILE_ID_128 appears to be supported on Windows Server only. On Windows client operating systems, you'd have to use some other way to check whether the files are in the same volume, but I don't know how to do that efficiently. Perhaps the volume check doesn't have to be efficient if you do it only when DWORD NumberOfLinks is greater than one and LARGE_INTEGER FileId already matches.

carlossanlop commented 2 years ago

Thanks @tmds. When I was implementing hardlinks, I didn't find information explaining st_nlink. I should've asked you directly.

Would you consider hardlinks a common enough scenario that we would have to fix this in 7? Or can this wait to be fixed in 8?

adamsitnik commented 2 years ago

I've moved it to 8.0 as I don't believe that such scenarios should be common. Moreover, it sounds like we are going to need to perform some extra work to get it working. This might cause minor perf regression.

tmds commented 2 years ago

Yes, 8 is fine.