Memory corruption with async methods

sywka commented 5 years ago

Description

"Parallel" access to a large number of blobs on a single connection leads to a hang. I have this happens when the number of iterations in 1E4. However, 1E3 takes a little time, about a few seconds. If the number of iterations <= 1E3, then test is done successfully

On a mac in this case, getting a blob length causes an error: Unrecognized response from Blob :: getInfo. And if the number of iterations <= 1E3, then test is done successfully, too

Example

describe("node-firebird-driver-native", () => {

    const TIMEOUT = 8.64E7;
    const count = 1E3;

    it("Unrecognized response from Blob::getInfo.", async () => {
        if (existsSync(dbOptions.path)) {
            unlinkSync(dbOptions.path);
        }

        const client = createNativeClient(getDefaultLibraryFilename());
        const attachment = await client.createDatabase(dbOptions.path, {username: "SYSDBA", password: "masterkey"});
        const transaction = await attachment.startTransaction();

        const promises = [];
        for (let i = 0; i < count; i++) {
            promises.push(
                (async () => {
                    const resultSet = await attachment.executeQuery(transaction, "SELECT * FROM RDB$RELATIONS");
                    const result = await resultSet.fetch();
                    for (let j = 0; j < result.length; j++) {
                        const row = result[j];
                        const blobs = row.filter((r) => r instanceof BBlob);
                        for (let k = 0; k < blobs.length; k++) {
                            const blob = blobs[k];
                            const blobStream = await attachment.openBlob(transaction, blob);
                            try {
                                await blobStream.length;
                            } finally {
                                await blobStream.close();
                            }
                        }
                    }
                    await resultSet.close();
                })()
            );
        }
        await Promise.all(promises);

        await transaction.commit();
        await attachment.dropDatabase();
        await client.dispose();

    }, TIMEOUT);
});

Versions

OS: Windows 10 Firebird: 3.0.4 Node: 11.6.0 Yarn: 1.13.0 node-firebird-driver-native: 0.1.1

asfernandes commented 5 years ago

I have tried your test with 1E4 iterations and it works for me taking 113.415ms.

The only way I see to have Unrecognized response from Blob :: getInfo. would be reading a blob with more than 2 GB.

OS: Windows 10
Firebird: 3.0.2
Node: 8.11.3
node-firebird-driver-native: 0.1.1

sywka commented 5 years ago

If it doesn’t make it difficult, try running with a large number of iterations, for example 5Е4 or 1Е5 (you may have to allocate more memory for the node: SET NODE_OPTIONS=--max-old-space-size=XXX && jest) I waited for a rather long amount of time (about an hour). At this time, the node and fb server process did not use resources at all.

The only way I see to have Unrecognized response from Blob :: getInfo. would be reading a blob with more than 2 GB.

So, why can it work with fewer iterations? The blob did not change the size and its length was successfully read on the first iterations.

asfernandes commented 5 years ago

Please try to use non-embedded Firebird connection and monitor amount of memory used by node and firebird processes.

sywka commented 5 years ago

With a non-embedded Firebird on Windows, I get the same error as on the Mac - Unrecognized response from Blob :: getInfo. Count of iterations - 1E4

sywka commented 5 years ago

If the appearance of an error is a server limit due to parallel execution, then this is not a problem, everything will work. At the same time, when the process is completely freezing in the embedded server mode, this is very bad.

asfernandes commented 5 years ago

What is freezing, Firebird server or node process?

sywka commented 5 years ago

It looks like the fb process. But I'm not sure. Perhaps the node simply does not receive a response from the client fb and the fb process is not frozen, I don't know.

Have you reproduced the problem?

asfernandes commented 5 years ago

I can't reproduce the problem. Tried in Windows and Linux.

Note that FB is probably not opening lots of blobs in parallel. Node libuv threads are 4 by default.

But you're making FB open lots of blobs in a single transaction.

Could you change the test so that each iteration open and close its own transaction and see the difference?

sywka commented 5 years ago

So, now I get error Error: invalid format for transaction parameter block. -wrong version of transaction parameter block

asfernandes commented 5 years ago

Could you debug it, what TPB was passed to Firebird when you got this error?

If TPB is correct, FB remote protocol has some problem with concurrency under the same attachment.

If TPB is passed incorrectly, then we have some memory problem in the driver.

You could also try to not create only a transaction per promise, but also an attachment to see what happens.

sywka commented 5 years ago

You could also try to not create only a transaction per promise, but also an attachment to see what happens.

Error: Invalid clumplet buffer structure: Unknown tag value - missing in the list of possible

So it looks like a memory problem or am I mistaken? How can I give you more information?

asfernandes commented 5 years ago

Please put here the code of this updated test.

sywka commented 5 years ago

describe.only("node-firebird-driver-native", () => {

    const TIMEOUT = 8.64E7;
    const count = 1E4;
    const path = resolve("./TEST.FDB");

    it("Error code: 3221225477", async () => {
        if (existsSync(path)) {
            unlinkSync(path);
        }

        const client = createNativeClient(getDefaultLibraryFilename());

        const db = await client.createDatabase("localhost/3050:" + path, {username: "SYSDBA", password: "masterkey"});

        const promises = [];
        for (let i = 0; i < count; i++) {
            promises.push(
                (async () => {
                    const attachment = await client.connect("localhost/3050:" + path, {
                        username: "SYSDBA",
                        password: "masterkey"
                    });
                    const transaction = await attachment.startTransaction();
                    try {
                        const resultSet = await attachment.executeQuery(transaction, "SELECT * FROM RDB$RELATIONS");
                        const result = await resultSet.fetch();
                        for (let j = 0; j < result.length; j++) {
                            const row = result[j];
                            const blobs = row.filter((col) => col instanceof Blob);
                            for (let k = 0; k < blobs.length; k++) {
                                const blob = blobs[k];
                                const blobStream = await attachment.openBlob(transaction, blob);
                                try {
                                    await blobStream.length;
                                } finally {
                                    await blobStream.close();
                                }
                            }
                        }
                        await resultSet.close();
                    } finally {
                        await transaction.commit();
                        await attachment.disconnect();
                    }
                })()
            );
        }

        await Promise.all(promises);

        await db.dropDatabase();
        await client.dispose();
    }, TIMEOUT);
});

asfernandes commented 5 years ago

I probably identified the root cause of the problem. XPB (DPB, TPB, BPB) are passed as arrays and in the native code are read as pointers. But with async routines, the stored pointer is read only when the background thread starts. As node has a limited thread pool and you run lots of promises, this pointer is get but takes a long time to be read. As soon the routine that asked for the promise continue it work, it XPB array goes out of scope and in a future moment is garbage collected before the pointer is really read.

As soon when I have a fix, I will tell you to test it.

sywka commented 5 years ago

Ok. Thanks

gsbelarus commented 5 years ago

Should it be using buffers instead of arrays? I think this arcticle:

https://community.risingstack.com/using-buffers-node-js-c-plus-plus/

addresses problems of sharing data between node's main loop and asynchronous native code of c++ plugin.

asfernandes commented 5 years ago

@sywka could you test with this branch https://github.com/asfernandes/node-firebird-drivers/tree/work/memory-corruption-async ?

It's not complete fix for all methods, but to the ones used in your test.

gsbelarus commented 5 years ago

XPB array goes out of scope and in a future moment is garbage collected before the pointer is really read.

I think that even if it remains in scope, GC could move it around as part of memory de fragmentation process or move it from young generation space to old. In either case pointer will become invalid.

asfernandes commented 5 years ago

@gsbelarus could it be moved even when stored in Nan::Persistent?

gsbelarus commented 5 years ago

I dont know. Need to investigate.

asfernandes commented 5 years ago

Well, the code above does that. Initially I tried to just store in Nan::Persistent and get the pointer inside the worker thread, but that seams to cause problem as then it's outside of a v8 thread.

So I need @sywka to test it, as I never been able to reproduce the problem.

asfernandes commented 5 years ago

@sywka if you prefer, you could also download the addon.node binary artifact from https://ci.appveyor.com/project/asfernandes/node-firebird-drivers/builds/21894266 and update your installation.

sywka commented 5 years ago

On the Mac a parallel opening of multiple attachments (>1E3) now gives an error - Error: connection rejected by remote interface. I can check windows on Monday.

asfernandes commented 5 years ago

Do it have something on firebird.log?

sywka commented 5 years ago

On the Mac a parallel opening of m...

Forget it, it seems I forgot to make a rebuild)

In general it seems that it works. The first test of this issue interrupted with an error Error: too many open handles to database. (as I understand it's database limitation). It would be great to get a release with fixes everywhere and check it out.

asfernandes commented 5 years ago

(as I understand it's database limitation).

It seems the FB limit is 64K connections.

It would be great to get a release with fixes everywhere and check it out.

Ok, I will do it.

gsbelarus commented 5 years ago

So, every place where memory block is formed inside Node and then pointer to it is passed to the native c++ driver should be carefully checked in order to prevent situations when memory block is destroyed or moved around by GC before c++ code reaches to it for further processing.

asfernandes commented 5 years ago

That's a job for node-cloop-gen do. AFAIK, the problem is only in getAddress calls.

gsbelarus commented 5 years ago

It seems the FB limit is 64K connections.

as I was told FB SuperServer has limit of 2048 simultaneous connections through inet protocol. Beyond this number there will be errors on network stack level like "connection refused" or similar.

asfernandes commented 5 years ago

@sywka I have updated the same branch with fix for all methods. Please test and report it so that I can release a new version.

sywka commented 5 years ago

In general, stability seems to me to have improved (I can't even catch the error from #13), but for example, creating a large number of parallel connections (>2E3) leads to a hang on Windows (Firebird process takes about 4 GB and does not use a CPU at all, the process node takes about 200 MB and also does not use CPU at all)

...
const db = await client.createDatabase('localhost/3050:' + path, options);
const promises = [];
for (let i = 0; i < count; i++) {
    promises.push(
        (async () => {
            const attachment = await client.connect('localhost/3050:' + path, {
                username: 'SYSDBA',
                password: 'masterkey'
            });
            await attachment.disconnect();
        })()
    );
}
await Promise.all(promises);
await db.dropDatabase();
...

Could it be that getting data from the fb-client to the node requires a similar correction as well as from the node to fb-client? And in this case, the node simply did not get the result of the execution and hung pending

asfernandes commented 5 years ago

Could it be that getting data from the fb-client to the node requires a similar correction as well as from the node to fb-client? And in this case, the node simply did not get the result of the execution and hung pending

The change "locks" the output node values too, so they do not go to GC.

Maybe that is a FB instability with such a huge number of connections. Anyway, your test seems not something realistic.

sywka commented 5 years ago

Well, I think you can close this issue. In any case, the reason for which it opened has already been fixed

gsbelarus commented 5 years ago

Ok. I believe that changes could be merged into the main branch now, and new version of the library be published.

asfernandes commented 5 years ago

Version 0.1.2 is just released.

gsbelarus commented 5 years ago

Thanks!

asfernandes / node-firebird-drivers