DagsHub / fds

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc
http://fastds.io
MIT License
382 stars 22 forks source link

"fds forget" feature proposal #65

Open guysmoilov opened 3 years ago

guysmoilov commented 3 years ago

Scenario: You accidentally git add'ed or dvc add'ed a path that you didn't intend to.

It's a commonly googled question: https://stackoverflow.com/questions/1274057/how-to-make-git-forget-about-a-file-that-was-tracked-but-is-now-in-gitignore

What fds forget can add:

  1. Easier naming - no more googling required
  2. Automatically detect whether the file is tracked by git or DVC
  3. Remove the file from DVC cache if it is tracked by DVC (after confirmation from the user)
  4. Remove the relevant .dvc file if it exists, and also make git forget about that file
  5. More?
indweller commented 3 years ago

Hi @guysmoilov I looked at the git part of this problem. There are two parts: a) If you have not yet committed the file yet, then a simple git restore --staged <file> will do. b) But if you want to untrack a file that has already been tracked and committed, then it's tricky because doing git rm --cached will remove the file from others' systems (locally) when they do a git pull (You also have to list the file in .gitignore). If we do git update-index --assume-unchanged, then it won't show the file in unstaged changes, but I think it continues to remain in the repo.

guysmoilov commented 3 years ago

@indweller Thanks for the research!
Yes, making git forget a committed file is daly next to impossible for a distributed repo.
As the first line in the issue suggests, I think we should focus on git add and dvc add - fds forget is IMO much easier to remember than git restore --staged <file> and also should handle removing the file from DVC tracking.

indweller commented 3 years ago

Ok so for the git part it can do git restore and the for the DVC part it can do dvc remove (https://dvc.org/doc/user-guide/how-to/stop-tracking-data). Can I work on this issue?

guysmoilov commented 3 years ago

@indweller I think you also need to run some form of dvc gc after dvc remove. And sure, thank you!

guysmoilov commented 3 years ago

Interesting potentially relevant project: https://rtyley.github.io/bfg-repo-cleaner/