Closed davidskalinder closed 4 years ago
If it doesn't deflect you for too long, a file to ponder and play with might be good.
On 5/14/2020 12:00 PM, davidskalinder wrote:
This issue follows on directly from what's been built for #58 https://github.com/davidskalinder/mpeds-coder/issues/58, but I'm opening a new thread for it here since the target functionality is pretty different.
So, @johnklemke https://github.com/johnklemke, as we just discussed I think I can produce a CSV that's pretty close to what we'll need, though right now it's only in the development deployment so I need to deploy it to something with more data in it to be a useful test.
However while reviewing my notes for the call I realized that I also need to change the multiple-entry behavior so that it concatenates the multiple entries rather than creates a new column for each entry. This will take a little more time than just re-deploying.
Will you find it valuable to start playing with the file with multiple columns for multiple entries? Or is there not much point in producing a test file until the concatenation is fixed?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/83, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEA6VYIC2JD7ZWAB4UTRRQPUJANCNFSM4NA3RVWA.
Okay sounds good, I'll do a dump of what I've got and let you know where to find it...
I've deployed this in testing and am getting some encoding errors, so I'll need to sort those out now (and would have had to eventually anyway!). So that may delay delivery slightly...
Sigh sigh sigh sigh sigh fixed the unicode bug in 9ba7a9eef after 2 hours of debugging only to create another one in the write to csv function. More delays (caused not by this request but, really, I think, by the decision to code MAI in Python 2.7)...
Okay that one was pretty easy. Fixed in 947f610e8a, merged into testing
and master
, and run in the testing deployment. Next I just need to take it into production and cross my fingers that it 1) doesn't break anything and 2) runs.
All right, that deployment was pretty hitchless.
So @johnklemke, check out gdelt/Skalinder/MAI_exports/by_coder_and_event_by_annotation_2020-05-14_180626.csv
(the file is named after its grain).
Like we discussed, this is a file with coders, article metadata, article-level coding, and event-level coding all outer-joined together. I've used the prefix event_
for all the event-level fields and the prefix article_
for all the article-level fields (since technically each level could have a variable with the same name). Also, all variables in the original tables have both a value
and a text
field, so I've used the _value
and _text
suffixes to distinguish them. (At the moment, I think the only variables that use both are the text captures, though that might change someday.) Any column that winds up with nothing in it at all is dropped.
Also as we discussed, at the moment any variables that appear multiple times at the same grain are split into multiple columns with the n
th column getting the suffix _n
right after the variable name for all n
>1. (So until I build in the concatenation, there'll be a lot of columns.)
I hope the rest should be as simple as possible but no simpler? But of course let me know if you run into any hurdles...
Okay @johnklemke, new version with multiple instances of fields concatenated. The file is in the same place but with a new timestamp.
I used a triple-pipe (|||
) as a delimiter, so of course we'll have to hope that nobody ever types that... Also note that because nearly all of the content is actually in paired value
and text
columns, I've made sure to keep the same number of delimited concatenations in both columns of the pair, even if one of the columns is blank (so there might be some cells that only contain |||
or ||||||
); but I think this should be rare or impossible with the fields we're using.
Also note that as I mentioned in https://github.com/davidskalinder/mpeds-coder/issues/58#issuecomment-630455901, there is a bug in this process that could cause a concurrency problem if someone creates some data in between the times when any of my four queries hit the database; I think that's unlikely to happen very often, but you might want to bear that in mind when thinking about the test output (which in theory might be affected).
I've been able to import into Access, working from a .csv I created out of LibreOffice Calc that does not contain any of the three-number text pointers from the *_value fields. But I do not see a pass 1 coding timestamp, which we expect to be useful.
Hmm, okay, glad you could get it that far. Were those three-number fields making the import fail?
As for the timestamps, I see now that this is my mistake. (I did all the querying and logic for them but forgot to join them in! D'oh!) There are timestamps for every single annotation, so instead of keeping them all my plan was to keep the earliest and latest timestamp for each coder-article pair if that sounds sensible to you? Anyway sorry for the error, I can fix it next week before (I think?) diving into debugging the import/export settings...
I don't think the three-number things were causing imports to fail, but for the time being at least they're useless in Access. Gives me fewer fields to skip over in the import specification.
On 5/22/2020 12:25 PM, davidskalinder wrote:
Hmm, okay, glad you could get it that far. Were those three-number fields making the import fail?
As for the timestamps, I see now that this is my mistake. (I did all the querying and logic for them but forgot to join them in! D'oh!) There are timestamps for every single annotation, so instead of keeping them all my plan was to keep the earliest and latest timestamp for each coder-article pair if that sounds sensible to you? Anyway sorry for the error, I can fix it next week before (I think?) diving into debugging the import/export settings...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/83#issuecomment-632818050, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEEV3DBTCDGMWQTWGBTRS2YSDANCNFSM4NA3RVWA.
There are some rows in the .csv that lack an internal article ID and a value in db_id. Does this make sense that there'd be such cases? I think they'd be useless in Pass 2, and I can readily avoid loading them into Access, so no big deal on my end of the process.
There are some rows in the .csv that lack an internal article ID and a value in db_id. Does this make sense that there'd be such cases? I think they'd be useless in Pass 2, and I can readily avoid loading them into Access, so no big deal on my end of the process.
No, that sounds like a bug. I'll check it out when I crack the thing open again. You got a line number handy? No worries if not since it should be easy to find, but if I know what you're looking at then I can be sure to fix that part...
As for the timestamps, I see now that this is my mistake. (I did all the querying and logic for them but forgot to join them in! D'oh!) There are timestamps for every single annotation, so instead of keeping them all my plan was to keep the earliest and latest timestamp for each coder-article pair if that sounds sensible to you? Anyway sorry for the error, I can fix it next week before (I think?) diving into debugging the import/export settings...
Moved to #84
There are some rows in the .csv that lack an internal article ID and a value in db_id. Does this make sense that there'd be such cases? I think they'd be useless in Pass 2, and I can readily avoid loading them into Access, so no big deal on my end of the process.
Moved to #86
This issue is splitting into lots of smaller ones, so I'm going to convert it into a (sub!) epic that just contains the rest...
I think everything attached to this epic is closed, so I'm going to close this as well. I'm tempted to recommend that this closure be permanent, since problems with the subissues should be raised there and new fixes should get new issues (since this issue is just to "create" the file!). But of course don't let me tell you what to do. :)
This issue follows on directly from what's been built for #58, but I'm opening a new thread for it here since the target functionality is pretty different.
So, @johnklemke, as we just discussed I think I can produce a CSV that's pretty close to what we'll need, though right now it's only in the development deployment so I need to deploy it to something with more data in it to be a useful test.
However while reviewing my notes for the call I realized that I also need to change the multiple-entry behavior so that it concatenates the multiple entries rather than creates a new column for each entry. This will take a little more time than just re-deploying.
Will you find it valuable to start playing with the file with multiple columns for multiple entries? Or is there not much point in producing a test file until the concatenation is fixed?