davidskalinder / mpeds-coder

MPEDS Annotation Interface
MIT License
0 stars 0 forks source link

Create MAI-to-pass-2 handover file #83

Closed davidskalinder closed 4 years ago

davidskalinder commented 4 years ago

This issue follows on directly from what's been built for #58, but I'm opening a new thread for it here since the target functionality is pretty different.

So, @johnklemke, as we just discussed I think I can produce a CSV that's pretty close to what we'll need, though right now it's only in the development deployment so I need to deploy it to something with more data in it to be a useful test.

However while reviewing my notes for the call I realized that I also need to change the multiple-entry behavior so that it concatenates the multiple entries rather than creates a new column for each entry. This will take a little more time than just re-deploying.

Will you find it valuable to start playing with the file with multiple columns for multiple entries? Or is there not much point in producing a test file until the concatenation is fixed?

johnklemke commented 4 years ago

If it doesn't deflect you for too long, a file to ponder and play with might be good.

On 5/14/2020 12:00 PM, davidskalinder wrote:

This issue follows on directly from what's been built for #58 https://github.com/davidskalinder/mpeds-coder/issues/58, but I'm opening a new thread for it here since the target functionality is pretty different.

So, @johnklemke https://github.com/johnklemke, as we just discussed I think I can produce a CSV that's pretty close to what we'll need, though right now it's only in the development deployment so I need to deploy it to something with more data in it to be a useful test.

However while reviewing my notes for the call I realized that I also need to change the multiple-entry behavior so that it concatenates the multiple entries rather than creates a new column for each entry. This will take a little more time than just re-deploying.

Will you find it valuable to start playing with the file with multiple columns for multiple entries? Or is there not much point in producing a test file until the concatenation is fixed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/83, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEA6VYIC2JD7ZWAB4UTRRQPUJANCNFSM4NA3RVWA.

davidskalinder commented 4 years ago

Okay sounds good, I'll do a dump of what I've got and let you know where to find it...

davidskalinder commented 4 years ago

I've deployed this in testing and am getting some encoding errors, so I'll need to sort those out now (and would have had to eventually anyway!). So that may delay delivery slightly...

davidskalinder commented 4 years ago

Sigh sigh sigh sigh sigh fixed the unicode bug in 9ba7a9eef after 2 hours of debugging only to create another one in the write to csv function. More delays (caused not by this request but, really, I think, by the decision to code MAI in Python 2.7)...

davidskalinder commented 4 years ago

Okay that one was pretty easy. Fixed in 947f610e8a, merged into testing and master, and run in the testing deployment. Next I just need to take it into production and cross my fingers that it 1) doesn't break anything and 2) runs.

davidskalinder commented 4 years ago

All right, that deployment was pretty hitchless.

So @johnklemke, check out gdelt/Skalinder/MAI_exports/by_coder_and_event_by_annotation_2020-05-14_180626.csv (the file is named after its grain).

Like we discussed, this is a file with coders, article metadata, article-level coding, and event-level coding all outer-joined together. I've used the prefix event_ for all the event-level fields and the prefix article_ for all the article-level fields (since technically each level could have a variable with the same name). Also, all variables in the original tables have both a value and a text field, so I've used the _value and _text suffixes to distinguish them. (At the moment, I think the only variables that use both are the text captures, though that might change someday.) Any column that winds up with nothing in it at all is dropped.

Also as we discussed, at the moment any variables that appear multiple times at the same grain are split into multiple columns with the nth column getting the suffix _n right after the variable name for all n>1. (So until I build in the concatenation, there'll be a lot of columns.)

I hope the rest should be as simple as possible but no simpler? But of course let me know if you run into any hurdles...

davidskalinder commented 4 years ago

Okay @johnklemke, new version with multiple instances of fields concatenated. The file is in the same place but with a new timestamp.

I used a triple-pipe (|||) as a delimiter, so of course we'll have to hope that nobody ever types that... Also note that because nearly all of the content is actually in paired value and text columns, I've made sure to keep the same number of delimited concatenations in both columns of the pair, even if one of the columns is blank (so there might be some cells that only contain ||| or ||||||); but I think this should be rare or impossible with the fields we're using.

Also note that as I mentioned in https://github.com/davidskalinder/mpeds-coder/issues/58#issuecomment-630455901, there is a bug in this process that could cause a concurrency problem if someone creates some data in between the times when any of my four queries hit the database; I think that's unlikely to happen very often, but you might want to bear that in mind when thinking about the test output (which in theory might be affected).

johnklemke commented 4 years ago

I've been able to import into Access, working from a .csv I created out of LibreOffice Calc that does not contain any of the three-number text pointers from the *_value fields. But I do not see a pass 1 coding timestamp, which we expect to be useful.

davidskalinder commented 4 years ago

Hmm, okay, glad you could get it that far. Were those three-number fields making the import fail?

As for the timestamps, I see now that this is my mistake. (I did all the querying and logic for them but forgot to join them in! D'oh!) There are timestamps for every single annotation, so instead of keeping them all my plan was to keep the earliest and latest timestamp for each coder-article pair if that sounds sensible to you? Anyway sorry for the error, I can fix it next week before (I think?) diving into debugging the import/export settings...

johnklemke commented 4 years ago

I don't think the three-number things were causing imports to fail, but for the time being at least they're useless in Access. Gives me fewer fields to skip over in the import specification.

On 5/22/2020 12:25 PM, davidskalinder wrote:

Hmm, okay, glad you could get it that far. Were those three-number fields making the import fail?

As for the timestamps, I see now that this is my mistake. (I did all the querying and logic for them but forgot to join them in! D'oh!) There are timestamps for every single annotation, so instead of keeping them all my plan was to keep the earliest and latest timestamp for each coder-article pair if that sounds sensible to you? Anyway sorry for the error, I can fix it next week before (I think?) diving into debugging the import/export settings...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/83#issuecomment-632818050, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEEV3DBTCDGMWQTWGBTRS2YSDANCNFSM4NA3RVWA.

johnklemke commented 4 years ago

There are some rows in the .csv that lack an internal article ID and a value in db_id. Does this make sense that there'd be such cases? I think they'd be useless in Pass 2, and I can readily avoid loading them into Access, so no big deal on my end of the process.

davidskalinder commented 4 years ago

There are some rows in the .csv that lack an internal article ID and a value in db_id. Does this make sense that there'd be such cases? I think they'd be useless in Pass 2, and I can readily avoid loading them into Access, so no big deal on my end of the process.

No, that sounds like a bug. I'll check it out when I crack the thing open again. You got a line number handy? No worries if not since it should be easy to find, but if I know what you're looking at then I can be sure to fix that part...

davidskalinder commented 4 years ago

As for the timestamps, I see now that this is my mistake. (I did all the querying and logic for them but forgot to join them in! D'oh!) There are timestamps for every single annotation, so instead of keeping them all my plan was to keep the earliest and latest timestamp for each coder-article pair if that sounds sensible to you? Anyway sorry for the error, I can fix it next week before (I think?) diving into debugging the import/export settings...

Moved to #84

davidskalinder commented 4 years ago

There are some rows in the .csv that lack an internal article ID and a value in db_id. Does this make sense that there'd be such cases? I think they'd be useless in Pass 2, and I can readily avoid loading them into Access, so no big deal on my end of the process.

Moved to #86

davidskalinder commented 4 years ago

This issue is splitting into lots of smaller ones, so I'm going to convert it into a (sub!) epic that just contains the rest...

davidskalinder commented 4 years ago

I think everything attached to this epic is closed, so I'm going to close this as well. I'm tempted to recommend that this closure be permanent, since problems with the subissues should be raised there and new fixes should get new issues (since this issue is just to "create" the file!). But of course don't let me tell you what to do. :)