doableware / djongo

Django and MongoDB database connector
https://www.djongomapper.com
GNU Affero General Public License v3.0
1.87k stars 353 forks source link

Bulk Create doesn't update instance references in runtime #306

Open Vichoko opened 5 years ago

Vichoko commented 5 years ago

One line description of the issue

When using bulk_create method, the returned queryset doesn't have id / pk. But the insertion is made on DB level.

Python script

```python class Text(models.Model): text= models.TextField() ``` ```python > Text.objects.count() 0 > texts = [Text(text='foo'), Text(text='bar)] > texts[0].id None > new_texts = Text.objects.bulk_create(texts) > new_texts[0].id None > texts[0].id None > Text.objects.count() 2 > Text.objects.all() , ]> > Text.objects.all()[0].id 1 > new_texts[0] == texts[0] True > new_texts[0] == Text.objects.all()[0] False ``` #### Expected behaviour Source reference to bulk_create and returned queryset should have id's after method it's called, and should reference the added DB model instances instead.
SomeoneInParticular commented 5 years ago

From the Django bulk_create() documentation:

If the model’s primary key is an AutoField it does not retrieve and set the primary key attribute, as save() does, unless the database backend supports it (currently PostgreSQL).

As the default primary key for models is still an AutoField in Djongo, this caveat comes up in your use case. If you want it to have primary keys updated with bulk_create, you need to specify the primary key explicitly instead (given you're using Djongo, I would recommend the ObjectIdField for this):

class Text(models.Model):
    _id = models.ObjectIdField()
    text = models.TextField()

This should fix the issue completely

Vichoko commented 5 years ago

Thank you very much for your answer.

Actually my problem is that i'm creating a big bunch of those instances and it's taking a long time. I'm wondering if doing this would speed up this operation?

In this case how do i assign an object id explicitly? Any clues?

bsdis commented 4 years ago

Hi. I also have issues with this. I have actually tried setting both _id = models.ObjectIdField() and _id = models.ObjectIdField(primary_key=True)

But _id attribute is not updated after bulk_create :( For normal .save() it works fine. This is quite a big issue.

class User(models.Model):
    _id = models.ObjectIdField(primary_key=True)
    emailAddress = models.CharField(_("emailAddress"), max_length=255, blank=False,default="")
    password = models.CharField(_("password"), max_length=255, blank=False,default="")
    attendeeCode = models.CharField(_("attendeeCode"), max_length=255, blank=False,default="")
    class Meta:
        db_table = "users"
    def __str__(self):
        return f'{self.pk}:{self.attendeeCode}'

from apps.users.models import User
u2 = User(attendeeCode="foobar")
User.objects.bulk_create([u2])
>> [<User: None:abe3>]
print(u2._id)
>>

Am I doing something wrong, is this a bug or is it just not supported? (in latter case, when is support for this planned? )

allenling commented 3 months ago

let's assume that you have a Record table, and i guess one of possible solutions would be as follows:

    with transaction.atomic():
        for _ in range(10):
            objs_list.append(Record.objects.create())

why? emmm...let's first take a look at how MySQL handles auto_incrment, and then how django loses ids after bulk_create.

types of inserts in MySQL

https://dev.mysql.com/doc/refman/8.0/en/innodb-auto-increment-handling.html

in MySQL, there's a couple types of inserts, but two main ones, simple inserts and bulk inserts.

please do not confuse those with bulk_create operation in django.

in MySQL, simple inserts are

Statements for which the number of rows to be inserted can be determined in advance.

and

This includes single-row and multiple-row Insert.

and bulk inserts are

Statements for which the number of rows to be inserted (and the number of required auto-increment values) is not known in advance. This includes INSERT ... SELECT, REPLACE ... SELECT, and LOAD DATA statements, but not plain INSERT.

and clearly, calling save and bulk_create will generate simple inserts in the context of MySQL.

how does MySQL handle simple INSERT statements?

InnoDB assigns new values for the AUTO_INCREMENT column one at a time as each row is processed.

but really how?

there's a lock for increasing the auto_increment column, and there's several lock modes there.

the AUTO_INCREMENT lock modes used to generate auto-increment values

The auto-increment lock mode is configured at startup using the innodb_autoinc_lock_mode variable.

There are three possible settings for the innodb_autoinc_lock_mode variable. The settings are 0, 1, or 2, for “traditional”, “consecutive”, or “interleaved” lock mode, respectively.

you might have consecutive mode or interleaved mode as default.

consecutive mode

under the consecutive mode, bulk inserts will lock the table!

In this mode, “bulk inserts” use the special AUTO-INC table-level lock and hold it until the end of the statement.

but for simple inserts, no!

“Simple inserts” (for which the number of rows to be inserted is known in advance) avoid table-level AUTO-INC locks by obtaining the required number of auto-increment values under the control of a mutex (a light-weight lock) that is only held for the duration of the allocation process, not until the statement completes.

as we can see, consective mode will force ids being allocated in sequence via the table lock

if you want your bulk inserts having consecutive ids, active the consecutive mode.

that' because in some cases, we have to wait for some subqueries to be executed in order to calculate the number of rows to be inserted, so lock the table.


   bulk_insert -> lock the table
                -> execute some queries to calculate the number of rows, suppose there's 10 rows.
                -> insert 10 rows, make sure ids are consecutive
                -> release the lock.

but for simple bulks, MySQL will acquire a light-lock to allocate a series of consective ids for the rows, that's it.

so the ids within the same simple insert will be consecutive.

interleaved mode

under the interleaved mode, bulk inserts would have unconsecutive ids, but no for simple inserts.

In this lock mode, no “INSERT-like” statements use the table-level AUTO-INC lock, and multiple statements can execute at the same time.

why? as the name suggests, concurrent inserts are allowed, but for one plain insert statement, ids are still consecutive.

In this lock mode, auto-increment values are guaranteed to be unique and monotonically increasing across all concurrently executing “INSERT-like” statements.

conclusion.

so simple inserts can be viewed as "atomic" and the ids are always consecutive for one insert statement.

turning on the consecutive mode or not really just depends on what you want to do with bulk inserts.

how dose django lose the ids after calling bulk_create?

let's see how save method retrieves the id for the object.

in short, after sending the insert statement, django will call last_insert_id for the id.


    # django/db/models/sql/compiler.py.SQLInsertCompiler#L1813
    def execute_sql(self, returning_fields=None):
        assert not (
            returning_fields
            and len(self.query.objs) != 1
            and not self.connection.features.can_return_rows_from_bulk_insert
        )
        opts = self.query.get_meta()
        self.returning_fields = returning_fields
        cols = []
        with self.connection.cursor() as cursor:
            for sql, params in self.as_sql():
                cursor.execute(sql, params)
            if not self.returning_fields:
                return []
            if (
                self.connection.features.can_return_rows_from_bulk_insert
                and len(self.query.objs) > 1
            ):
                rows = self.connection.ops.fetch_returned_insert_rows(cursor)
                cols = [field.get_col(opts.db_table) for field in self.returning_fields]
            elif self.connection.features.can_return_columns_from_insert:
                assert len(self.query.objs) == 1
                rows = [
                    self.connection.ops.fetch_returned_insert_columns(
                        cursor,
                        self.returning_params,
                    )
                ]
                cols = [field.get_col(opts.db_table) for field in self.returning_fields]
            else:
                cols = [opts.pk.get_col(opts.db_table)]
                rows = [
                    (
                        self.connection.ops.last_insert_id(
                            cursor,
                            opts.db_table,
                            opts.pk.column,
                        ),
                    )
                ]

in this execute_sql, the id field will be passed as returning_fields, and so we will go to the last else statement calling self.connection.ops.last_insert_id.

that's the first id among the ones newly inserted.

say the last_insert_id is x, and you are inserting 10 rows, then you can have their ids as x, x+1,x+2, ...

however, when calling bulk_create method, the parameter returning_fields will be None.

so we just return without calling last_insert_id.

why?

it turns out that django will set returning_fields as None if the database can not return rows from bulk_insert.


    # db/models/query.py.QuerySet:L1852
    def _batched_insert(
        self,
        objs,
        fields,
        batch_size,
        on_conflict=None,
        update_fields=None,
        unique_fields=None,
    ):
        bulk_return = connection.features.can_return_rows_from_bulk_insert
        for item in [objs[i : i + batch_size] for i in range(0, len(objs), batch_size)]:
            if bulk_return and (
                on_conflict is None or on_conflict == OnConflict.UPDATE
            ):
                inserted_rows.extend(
                    self._insert(
                        item,
                        fields=fields,
                        using=self.db,
                        on_conflict=on_conflict,
                        update_fields=update_fields,
                        unique_fields=unique_fields,
                        returning_fields=self.model._meta.db_returning_fields,
                    )
                )
            else:
                self._insert(
                    item,
                    fields=fields,
                    using=self.db,
                    on_conflict=on_conflict,
                    update_fields=update_fields,
                    unique_fields=unique_fields,
                )

if you are using MySQL, then connection.features.can_return_rows_from_bulk_insert will be False, which results in passing returning_fields as None.

solution.

as we mentioned above, simple inserts are always consecutive, so in theory, we can retrieve the ids for the objects created by bulk_create.

another solution will be open a transaction, and insert a row to shift the auto_increment field one step forward, this looks like you are kind of locking a spot using that light-lock, then commit all of your inserts.

why?

because until you commit, the insert statements just increase the value of the auto_increment field, and nothing else.

you will not see the records on the table on the MySQL Workbench, and if you select the table, you will not get those uncommit rows.

    with transaction.atomic():
        for _ in range(10):
            objs_list.append(Record.objects.create())
       # if someone selects the table here at the same time, he will not get those newly inserted Records.

the main difference is how many times we are going to use that light-lock to increment that auto_increment field.