Add activity count to execute results

capitalone / giraffez

User-friendly Teradata client for Python

https://capitalone.github.io/giraffez

Apache License 2.0

108 stars 31 forks source link

Add activity count to execute results #55

Closed theandrew168 closed 4 years ago

theandrew168 commented 6 years ago

This is my first proof of concept to resolve issue #54. Pulling the value out of the PclSUCCESS parcel was easy enough, but I saw a few different options for propagating the rowcount back up to a Cursor object. I initially looked at using the return value from teradata_execute, but that would have required a few more changes to the codebase (as well as changing the existing usage of that function).

The changes in this PR involve adding a rowcount field to the TeradataConnection struct. Then, if a PclSUCCESS is found while reading parcels, pull out its ActivityCount and throw it into that field. Lastly, a property was added to the Cursor object for reading this value.

My primary concern with this implementation is that rowcount doesn't seem to really belong in the TeradataConnection struct. Had there existed some sort of TeradataCursor struct, I would have leaned toward putting the value in there instead. However, from what I was able to discern, the passing of database state between Cmd.execute and the Cursor it returns is contained within the conn parameter. Additionally (and I could be wrong), it looks like this conn parameter is actually an instance of the Cmd object defined here. Another downside of having the field located in TeradataConnection is that it'd get overwritten by every subsequent database action.

Do you suppose it'd be more ideal to add the rowcount field to something like TeradataEncoder since its "context" is more closely related to each individual SQL command? Do you suppose that some completely different scheme that uses the return value from teradata_execute would better fit this problem?

Please let me know what you think!

ktravis commented 6 years ago

Hi Andrew,

Thanks for submitting this PR. Given the way we handle passing state currently (as you mention above), I think this is probably a reasonable way to convey the row count from the Cmd to the Cursor.

A couple of quick notes regarding what could be done before merging, in my opinion:

Looks like the debug arg was left in but doesn't do anything at the moment
You mention the DB API 2.0 specs in your original issue; they say to return -1 in the case that a rowcount is unavailable or can't be determined. This sounds like a good idea to me - it looks like conn->rowcount should probably be set to -1 on creation and whenever teradata_execute is called, does this logic sound right (to everyone)?
Is there a failure case in decoding the ActivityCount attribute from the success parcel? I haven't checked on that part of the docs recently, but I will now. Even if it is overkill it might be worth checking the length of the data anyway before decoding.
It looks like ActivityCount and hence rowcount is a uint32_t, this should be used as the field type and for the temporary variable in teradata_execute. Of course this prohibits us from using -1 directly in that field as specified above: I would maybe suggest keeping some separate state (on the conn or otherwise) which indicated to Cmd_rowcount whether we are in a valid state for a row count or not, and checking that to determine if -1 should be returned. Definitely open to suggestion here, there might be something more straightforward to do.
Docs for this change would be good to have, especially to document the "quirks" of behavior. I think the need for iterating the cursor object before rowcount will be populated warrants an example as well. If you are interested in adding this please feel free, otherwise I would be happy to update the docs.

Curious to hear thoughts from you and others, as there is a bit of nuance here. Thank you again!

theandrew168 commented 6 years ago

I updated the PR with a couple of changes. The first is that I removed the unnecessary argument checking from the Cmd_rowcount function since it doesn't accept any. Second, I updated the type of rowcount in the TeradataConnection struct and added some thoughts.

I could see reasons for using either long, ssize_t, or int64_t to hold the value since it needs to support up to UINT32_MAX. It would also be nice to allow for -1 as an indicator of an error or that the field hasn't been initialized yet. Please let me know your thoughts on this.

Additionally, I did some testing on when the PclSUCCESS arrives in the scope of a typical execute query. The pseudo-code I used for this is shown below:

import giraffez

with giraffez.Cmd(host='hostname', username='username', password='password') as cmd:

    # Assume that schema.table1 holds M records
    results = cmd.execute("select * from scheme.table1")
    print(results.rowcount)  # Prints M before any records have been fetched

    for _ in range(10):
        row = next(results)
        print(row)  # Prints the next row of info as expected
        print(results.rowcount)  # Continues to print M

    # Assume that schema.table2 holds N records
    results = cmd.execute("select * from scheme.table2")
    print(results.rowcount)  # Prints N before any records have been fetched

    for _ in range(10):
        row = next(results)
        print(row)  # Prints the next row of info as expected
        print(results.rowcount)  # Continues to print N

Given these results, it definitely appears that PclSUCCESS arrives (with the correct value) before any records have been fetched and continues to remain valid until a new query gets executed. This leads me to believe that any given call to Cmd.execute() returns only one PclSUCCESS. If this is the case, then any of the subsequent data fetching operations won't have an effect on the rowcount that was already set for a given query. There might still be some logic or situations that I'm missing but it seems to be working as expected.

ChrisRx commented 6 years ago

I was working on an experimental branch to also try and implement a TeradataCursor type (for lack of a better description) that is passed around to keep information such as the command string and the rowcount, and I will try and commit it as is for a reference in a different branch (it is working I believe but can't make a ton of guarantees on it). I made a new type instead of attaching to the TeradataConnection because while the TeradataConnection type corresponds to a Teradata session, it is possible in the future to track parcel requests for [limited] asynchronous execution within a single session, and having that already split out would help. I was also trying to improve the messy interface a bit by removing the multiple execute functions. I have experienced the same observations you have made re: PclSUCCESS so far, so I think its safe to say the behavior is as expected.

But yeah, I will try and get that committed to a branch for review, comments, thoughts, etc since it might be helpful to implementing this to have multiple eyes and opinions on it.

theandrew168 commented 6 years ago

I hope things are well! I'm curious if there is anything I can do to help this PR along? Are there any design decisions or other problems yet to be solved? If so, I'd be glad to lend a hand where possible. Thanks!

ChrisRx commented 6 years ago

Did you have a chance to look at the branch ext-cleanup? It ends up changing a lot of the teradata C extension code and adds a kind of "cursor" object to return things like rowcount, and I wanted to see if you had a chance to check it out and see if the rowcount feature works as expected. If it does then I will just go ahead and accept this PR, and rebase that one, but wanted to get your input first before proceeding.

theandrew168 commented 5 years ago

I went ahead and spent some time looking at the ext-cleanup branch this morning. I definitely like having an explicit concept of cursors. I did notice one small bug, however. The struct CliSuccessType defines ActivityCount as an array of Pclchars, which are signed. This is messing up the shift math and resulting in invalid counts. An explicit cast of each element to unsigned char fixed this issue for me, however.

diff --git a/giraffez/src/teradata.c b/giraffez/src/teradata.c
index 49a6215..87f47dc 100644
--- a/giraffez/src/teradata.c
+++ b/giraffez/src/teradata.c
@@ -329,10 +329,10 @@ PyObject* teradata_handle_parcel_status(TeradataCursor *cursor, const uint32_t p
                 success = (struct CliSuccessType*)*data;
                 *data += sizeof(success);
                 uint32_t count = 0;
-                count = success->ActivityCount[0];
-                count |= success->ActivityCount[1] << 8;
-                count |= success->ActivityCount[2] << 16;
-                count |= success->ActivityCount[3] << 24;
+                count = (unsigned char)success->ActivityCount[0];
+                count |= (unsigned char)success->ActivityCount[1] << 8;
+                count |= (unsigned char)success->ActivityCount[2] << 16;
+                count |= (unsigned char)success->ActivityCount[3] << 24;
                 cursor->rowcount = (int64_t)count;
             }
             break;

Other than that, I think the branch looks solid. I hope that the addition of the cursor concept doesn't regress any other features of the code but, then again, it isn't that big of a change. This branch also supplants my original PR so merging it seems unnecessary.

Let me know what you think!

ChrisRx commented 5 years ago

Very good catch, I had thought I casted those to unsigned char but clearly did not (also the few tests I had run were with low result numbers that were unaffected by the incorrect shift math). I'll try to merge this soon after some more tests, but if any regressions are introduced that you find just let me know in the issues and we can get them addressed quickly. Through a comedy of errors I lost access to a Teradata server to test with and my access to this repo, but will hopefully be getting access to another Teradata server to continue support of giraffez (and I finally got my repo permissions fixed).