dolthub / dolt

Dolt – Git for Data
Apache License 2.0
17.69k stars 498 forks source link

Support `dolt add -p` #2465

Open addisonklinke opened 2 years ago

addisonklinke commented 2 years ago

In git, you can stage chunks of a file for commit via the -p flag. In dolt, I would expect the equivalent to be staging subsets of rows/columns but currently it appears only whole table adds are supported

To get the same behavior as git, you'd have to remove all changes in your working tree and then add them back one-by-one with a commit following each (which would get very tedious for many changes)

andy-wm-arthur commented 2 years ago

@addisonklinke thanks for submitting this issue. It may be possible to support staging subsets in Dolt, but it would take some careful design to get it right.

Dolt's storage-layer is row-major, so any partial staging logic would also need to row-wise to doing a O(n) table scan while performing a stage operation. A possible interface for this would be to stage rows using a SQL query where we stage all rows that fall range of primary-key values.

Do you have any more thoughts on this, or any more info about your use case?

addisonklinke commented 2 years ago

@andrew-wm-arthur Sorry for the delay here, I think this would be most useful for interactively staging on the command line. At times, I could envision I've made multiple categories of local changes without committing in-between (not the greatest workflow, but it happens). In that case, I'd like to be able to divide these logical changes into two different commits so the intention is more clear from reading the commit log

For example, say I have an object detection dataset

$ dolt init
$ dolt sql

test> CREATE TABLE objects (id INT NOT NULL AUTO_INCREMENT, label TEXT, bbox TEXT, PRIMARY KEY (id));
test> INSERT INTO objects (label, bbox) VALUES ("cat", "[1, 2, 3, 4]"); 
test> INSERT INTO objects (label, bbox) VALUES ("dog", "[10, 20, 30, 40]"); 
test> INSERT INTO objects (label, bbox) VALUES ("dog", "[5, 6, 7, 8]"); 
test> select * from objects;
+----+-------+------------------+
| id | label | bbox             |
+----+-------+------------------+
| 1  | cat   | [1, 2, 3, 4]     |
| 2  | dog   | [10, 20, 30, 40] |
| 3  | dog   | [5, 6, 7, 8]     |
+----+-------+------------------+

$ dolt add objects
$ dolt commit -m "initial data"

Then I decide the cat bounding box is wrong and dogs would be better split up by breed

test> update objects set bbox="[3, 4, 5, 6]" where id=1;
test> update objects set label="poodle" where id=2;
test> update objects set label="bulldog" where id=3;

$ dolt diff
diff --dolt a/objects b/objects
--- a/objects @ 73hiqmiduef0sqtecba4fav7vuuvdk2l
+++ b/objects @ 1hq161cev9kkt6eukvap0jmrfeedvt9j
+-----+----+---------+------------------+
|     | id | label   | bbox             |
+-----+----+---------+------------------+
|  <  | 1  | cat     | [1, 2, 3, 4]     |
|  >  | 1  | cat     | [3, 4, 5, 6]     |
|  <  | 2  | dog     | [10, 20, 30, 40] |
|  >  | 2  | poodle  | [10, 20, 30, 40] |
|  <  | 3  | dog     | [5, 6, 7, 8]     |
|  >  | 3  | bulldog | [5, 6, 7, 8]     |
+-----+----+---------+------------------+

My request is to have an interface like below which mirror's git's

$ dolt add -p

diff --dolt a/objects b/objects
--- a/objects @ 73hiqmiduef0sqtecba4fav7vuuvdk2l
+++ b/objects @ 1hq161cev9kkt6eukvap0jmrfeedvt9j
@@ row 1,3
+-----+----+---------+------------------+
|     | id | label   | bbox             |
+-----+----+---------+------------------+
|  <  | 1  | cat     | [1, 2, 3, 4]     |
|  >  | 1  | cat     | [3, 4, 5, 6]     |
|  <  | 2  | dog     | [10, 20, 30, 40] |
|  >  | 2  | poodle  | [10, 20, 30, 40] |
|  <  | 3  | dog     | [5, 6, 7, 8]     |
|  >  | 3  | bulldog | [5, 6, 7, 8]     |
+-----+----+---------+------------------+
(1/1) Stage these row(s) [y,n,q,a,d,e,?]? s  # request to split rows

@@ row 1,1  
+-----+----+---------+------------------+
|     | id | label   | bbox             |
+-----+----+---------+------------------+
|  <  | 1  | cat     | [1, 2, 3, 4]     |
|  >  | 1  | cat     | [3, 4, 5, 6]     |
+-----+----+---------+------------------+
(1/3) Stage these row(s) [y,n,q,a,d,e,?]? y  # Stage the cat bbox change

@@ row 2,2  
+-----+----+---------+------------------+
|     | id | label   | bbox             |
+-----+----+---------+------------------+
|  <  | 2  | dog     | [10, 20, 30, 40] |
|  >  | 2  | poodle  | [10, 20, 30, 40] |
+-----+----+---------+------------------+
(2/3) Stage these row(s) [y,n,q,a,d,e,?]? d  # Don't stage row 2 and skip the remaining hunks (row 3)

There are a couple differences in my example compared to git

Afterwards you'd have

$ dolt status
On branch main

Changes to be committed:
  (use "dolt reset <table>..." to unstage)
    modified:       objects

Changes not staged for commit:
  (use "dolt add <table>" to update what will be committed)
  (use "dolt checkout <table>" to discard changes in working directory)
    modified:       objects
timsehn commented 2 years ago

It would be even cooler to be able to do it with sql:

dolt add -sql "select * from table where"

addisonklinke commented 2 years ago

I think both options can be nice to have on a multi-disciplinary team to cater towards each members' strengths. For instance in ML, the data engineers I work with would probably prefer to use SQL in this scenario. However, I would guess that most ML engineers like myself would be more comfortable with the git syntax. I can get by in SQL, but it's definitely not my native language so it feels rather slow and cumbersome

For ML training, my preference is to load the data out of SQL and into Pandas for Python processing as soon as possible. In that regard, my main usage of Dolt is the versioning capabilities and schema enforcement. Of course, I'm sure I could get additional value out of that process if I was more familiar with SQL, but unfortunately there's only time to master so many tools/languages 🙂

Hopefully that gives a better sense of where I'm coming from

timsehn commented 2 years ago

Understood :-)

My point is that we should implement both.

addisonklinke commented 2 years ago

Regarding my link to the unidiff library above, it's unclear to me whether the incompatibility would be best solved by

  1. Updating their PatchSet parser
  2. Adding additional standardized format components into Dolt's output until it's similar enough to git's to be parsed. I wonder in particular whether chunk headers (like the @@ row 1,3 @@ I used above) would help?
  3. Using an existing Dolt command which can already produce the desired output

P.S. for context I used unidiff.PatchSet in my MLOps pipelines to build and log artifacts containing a change list from git diffs, and I would like to do the same for Dolt. For example

# Git change list
changes = [
    {
        'path': 'my_file.py',
        'additions': 4,
        'deletions': 0
    }
]

# Desired Dolt change list
changes = [
    {
        'table': 'objects',
        'additions': 3,
        'deletions': 3
    }
]
christopherreay commented 1 month ago

Our usecase would be on the basis that table data is being updated by several processes / roles. It would be possible perhaps to find a workaround for this by spawning tables more or less per user, when people log in. Even so, it could become quite complex.

Or I suppose that we can use separate database connections? To create separate working areas, right?

macneale4 commented 1 week ago

The first steps of this change are in place: https://docs.dolthub.com/sql-reference/version-control/dolt-system-tables#dolt_workspace_usdtablename

Adding this to the CLI will be next.

macneale4 commented 1 week ago

@christopherreay - Really the only way this can work safely with multiple people is if everyone uses their own branch - which is generally what we recommend. Uncommitted changes on a shared group branch doesn't play to Dolt's strengths as a database.

It's possible to use our new feature, doltworkspace tables, to stage and unstage changes on a shared branch. But it's always recommended to double check what is staged before you commit. Even then, you need to ensure that no one is mucking with things. Doing this on a shared branch has all the same problems as regular concurrency issues when people are on the same branch.