davidskalinder / mpeds-coder

MPEDS Annotation Interface
MIT License
0 stars 0 forks source link

Make sure MAI-to-pass-2 handover file imports to MS Access correctly #87

Closed davidskalinder closed 4 years ago

davidskalinder commented 4 years ago

...cuz at the moment, it duddn't. Will probably need binary-search debugging.

davidskalinder commented 4 years ago

Uh, well, @johnklemke (CC @olderwoman), I guess today is magic-bugfix day. I have done nothing to address this issue and now the import (using the file I made yesterday for #86) works for me. I guess the changes I made in #84 and #86 somehow tricked Access into reading the file properly? I'm not sure I understand it, but I'll take it.

Anyway, setting the Text Qualifier to " and the Code Page to UTF-8 got me most of the way there, but I had a lot of truncation errors since the Short Text type which Access apparently uses for strings based on the first 24 lines is often too short.

The list of fields I had to change to Long Text for now is below, but honestly there will probably be more of these in the future (once coders enter lots of text into more fields) -- the character limit in the MySQL table is 2000, and multiple-entry fields (like all the text capture fields) could well be concatenated into more than that. @johnklemke, I dunno if you have a good trick to conveniently set all those fields to Long Text (fka Memo, apparently)? I'm happy to leave that to you, but let me know if I can help...

Otherwise, as with the other issues, I'll move this to user-testing; let me know if I can close it or if you find some more delicious bugs!

Here are the fields in the current dataset that I had to change to Long Text:

article_article-desc_value
event_actions-text_text
event_actor-text_text
event_actor-text_value
event_basic-info-uncertain_value
event_bystander-text_text
event_desc_value
event_issue-text_text
event_location-text_text
event_named-org-text_text
event_police-actions-text_text
event_size-text_text
event_target-text_text
event_time-text_text
olderwoman commented 4 years ago

bpp_by_coder_and_event_by_annotation_2020-05-27_164412.csv errors of unmatched quotes without strict binding; no errors with strict binding. Still has concatenated dates in the event_enddate_value and event_startdate_value fields. No problem any more with missing article_id

olderwoman commented 4 years ago

bpp_mai_2020-05-27_164412bad_dates.xlsx

olderwoman commented 4 years ago

John reports also getting date conversion errors in his test import, which is almost certainly a different way of noting the same problem.

johnklemke commented 4 years ago

This is looking very promising. No tricks that I've stumbled on, but I'm able to crawl thru the import wizard fairly expeditiously now. Jeez, apparently many versions ago you could edit them. But I think this is gonna work.

On 5/28/2020 5:37 PM, davidskalinder wrote:

Uh, well, @johnklemke https://github.com/johnklemke (CC @olderwoman https://github.com/olderwoman), I guess today is magic-bugfix day. I have done nothing to address this issue and now the import (using the file I made yesterday for #86 https://github.com/davidskalinder/mpeds-coder/issues/86) works for me. I guess the changes I made in #84 https://github.com/davidskalinder/mpeds-coder/issues/84 and #86 https://github.com/davidskalinder/mpeds-coder/issues/86 somehow tricked Access into reading the file properly? I'm not sure I understand it, but I'll take it.

Anyway, setting the Text Qualifier to |"| and the Code Page to UTF-8 got me most of the way there, but I had a lot of truncation errors since the |Short Text| type which Access apparently uses for strings based on the first 24 lines https://stackoverflow.com/questions/39515045/auto-importing-into-access-forcing-field-type is often too short.

The list of fields I had to change to |Long Text| for now is below, but honestly there will probably be more of these in the future (once coders enter lots of text into more fields) -- the character limit in the MySQL table is 2000, and multiple-entry fields (like all the text capture fields) could well be concatenated into more than that. @johnklemke https://github.com/johnklemke, I dunno if you have a good trick to conveniently set all those fields to |Long Text| (fka |Memo|, apparently)? I'm happy to leave that to you, but let me know if I can help...

Otherwise, as with the other issues, I'll move this to user-testing; let me know if I can close it or if you find some more delicious bugs!

Here are the fields in the current dataset that I had to change to |Long Text|:

|article_article-desc_value event_actions-text_text event_actor-text_text event_actor-text_value event_basic-info-uncertain_value event_bystander-text_text event_desc_value event_issue-text_text event_location-text_text event_named-org-text_text event_police-actions-text_text event_size-text_text event_target-text_text event_time-text_text |

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/87#issuecomment-635644903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEBRI4IVMIHKSEFQCMTRT3RUPANCNFSM4NKYMYHA.

davidskalinder commented 4 years ago

bpp_by_coder_and_event_by_annotation_2020-05-27_164412.csv errors of unmatched quotes without strict binding; no errors with strict binding.

I assume that means the bindqoutes(strict) option for Stata's import delimited command? That does sound to me like what we want (since multiline fields are common in our data). Good to know that it works for Stata since I only checked it in Access.

Still has concatenated dates in the event_enddate_value and event_startdate_value fields.

Yep, that's issue #85, which I haven't gotten to yet. :) It's not really a problem with the import, but rather with the way MAI is occasionally saving things to the database (i.e., twice).

No problem any more with missing article_id

Great! That's issue #86, so you might want to check the details listed there to make sure they're all working as expected.

John reports also getting date conversion errors in his test import, which is almost certainly a different way of noting the same problem.

That would certainly happen when any multiple-entry field is imported as a date type, yes. (I'm pretty sure I imported all the dates as text, just to make sure the file imported at all.) At the moment this seems to only cause a problem because of bug #85, but we should bear in mind that the way multiple-entry dates appear in the import file is probably actually the desired behavior, since we might intentionally create a multi-entry date field (which would be just fine with MAI).

I'm not sure what the best solution for that is, but given that MAI has no notion of which _text or _value fields should contain dates (or any data type other than text, really), it seems like perhaps we shouldn't rely on them being what we want them to be when we import? I suppose my inclination would be to import them leniently (i.e., as text) into a staging table and then clean them up before appending them to the real table? Although that might simply kick the issue down the road... Bottom line is we should have a solution to non-text multiple-entry fields, and I'm not sure I know what it is (although I suspect that's not something that can be solved by the import format).

davidskalinder commented 4 years ago

we might intentionally create a multi-entry date field

Actually, I think I'm overthinking this. We have to make some assumptions about the fields if we're going to use wide-format tables in pass 2 at all, so we might as well make the assumption about the data type; and then if it ever goes wrong we'll pay attention to whatever went wrong.

So maybe the best solution is to import it as text and work around it for now until I have time to fix #85 (probably not until next week).

johnklemke commented 4 years ago

Hmmm. Maybe -- how likely are we to try to do date arithmetic -- maybe not too likely. But is this an obvious manifestation of a problem that might be undetectably messing up other concatenated fields?

On 5/28/2020 7:06 PM, davidskalinder wrote:

we might intentionally create a multi-entry date field

Actually, I think I'm overthinking this. We have to make some assumptions about the fields if we're going to use wide-format tables in pass 2 at all, so we might as well make the assumption about the data type and then if it ever goes wrong we'll pay attention to it.

So maybe the best solution is to import it as text and work around it for now until I have time to fix #85 https://github.com/davidskalinder/mpeds-coder/issues/85 (probably not until next week).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/87#issuecomment-635681068, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEF42R3VSVRY7OCU3RTRT34AFANCNFSM4NKYMYHA.

johnklemke commented 4 years ago

Re importing leniently into a staging table, that's in much the same vein as my thoughts about using a linked .csv -- could apply fixes in the .csv with Excel and possibly do it more transparently. But I do come back to hoping this isn't something, possibly in the MAI data model, that can screw us up more subtly in another, less strictly formatted field. But maybe even if there is a problem it's just a mouse fart in a hurricane...

On 5/28/2020 6:51 PM, davidskalinder wrote:

bpp_by_coder_and_event_by_annotation_2020-05-27_164412.csv errors
of unmatched quotes without strict binding; no errors with strict
binding.

I assume that means the |bindqoutes(strict)| option for Stata's |import delimited| command? That does sound to me like what we want (since multiline fields are common in our data). Good to know that it works for Stata since I only checked it in Access.

Still has concatenated dates in the event_enddate_value and
event_startdate_value fields.

Yep, that's issue #85 https://github.com/davidskalinder/mpeds-coder/issues/85, which I haven't gotten to yet. :) It's not really a problem with the import, but rather with the way MAI is occasionally saving things to the database (i.e., twice).

No problem any more with missing article_id

Great! That's issue #86 https://github.com/davidskalinder/mpeds-coder/issues/86, so you might want to check the details listed there to make sure they're all working as expected.

John reports also getting date conversion errors in his test
import, which is almost certainly a different way of noting the
same problem.

That would certainly happen when any multiple-entry field is imported as a date type, yes. (I'm pretty sure I imported all the dates as text, just to make sure the file imported at all.) At the moment this seems to only cause a problem because of bug #85 https://github.com/davidskalinder/mpeds-coder/issues/85, but we should bear in mind that the way multiple-entry dates appear in the import file is probably actually the desired behavior, since we might intentionally create a multi-entry date field (which would be just fine with MAI).

I'm not sure what the best solution for that is, but given that MAI has no notion of which |_text| or |_value| fields should contain dates (or any data type other than text, really), it seems like perhaps we shouldn't rely on them being what we want them to be when we import? I suppose my inclination would be to import them leniently (i.e., as text) into a staging table and then clean them up before appending them to the real table? Although that might simply kick the issue down the road... Bottom line is we should have a solution to non-text multiple-entry fields, and I'm not sure I know what it is (although suspect that's not something that can be solved by the import format).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/87#issuecomment-635676484, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHED6STKEJNWQ2ZUZ653RT32GZANCNFSM4NKYMYHA.

olderwoman commented 4 years ago

So a look at the "long file" coder-table_2020-04-095615.csv makes it look like this is due to multiple records with start-date or end-date. This can happen when a coder changes their mind. So we do seem to have the problem of not being able to distinguish appending multiple responses from replacing responses, so far as I can tell. I think the long file will be instructive in figuring things out. For dates we probably want to overwrite and take the one with last timestamp. But is a record generated when a coder deletes marked text in the text-marking fields?

davidskalinder commented 4 years ago

Hmmm. Maybe -- how likely are we to try to do date arithmetic -- maybe not too likely.

Well, it's pretty likely that we'll want to sort and/or filter by them at least...

But is this an obvious manifestation of a problem that might be undetectably messing up other concatenated fields?

So it could in the sense of #85, if things are getting double-entered for some reason. And it could in the sense that there might be some multiple-entry fields that contain, for example, integers, but which become strings when they're concatenated. Mostly I think the issue is how to handle concatenated fields; but maybe a good-enough answer is to treat them as text and avoid having multiple entries for things we want to treat as non-text...

Re importing leniently into a staging table, that's in much the same vein as my thoughts about using a linked .csv -- could apply fixes in the .csv with Excel and possibly do it more transparently.

My concern with Excel is that it's just too darn flexible -- too easy to make mistakes and forget that we made them. Mind you, running some VBA to do whatever we want to do would mitigate that; but if we're going to do that, why not simply do it in Access and skip the middleman?

But I do come back to hoping this isn't something, possibly in the MAI data model, that can screw us up more subtly in another, less strictly formatted field.

Yeah, again, I think the problem is that MAI's use of long tables means that it doesn't really care what it puts into variables: there could be any number of anything with any name. That make it really easy to add new variables, but hard to predict what they are. I think the only solutions are 1) to build subsequent analyses using a similar structure (which of course would change everything about how we've set up pass 2), or 2) to go ahead and make the assumptions we want about variables and let Access's import errors tell us if they've been violated...

johnklemke commented 4 years ago

I've worked enough with the latest test files to feel pretty confident that any conversion errors we run into will be because of the data itself, not deficiencies in the files I'm importing into Access. I'm on the verge of bringing an alternate version of Pass 2 (for black newspaper data, not newswire data) into testing and do not expect any technical issues with either importing the .csv into Access or executing an append query to bring the pass 1 coder/article/event coder lines into tblBnpCoderLines for use in pass 2 workflows.

davidskalinder commented 4 years ago

Okay, sounds like this one is pretty close to being in hand.

So I've been thinking about #85 (I wonder why?), which I think is still causing most of the errors in the data itself? I think the best way to handle the problem for now is simply to saw off everything but the rightmost value in any concatenated fields that have multiple entries but shouldn't. (I think, off the top of my head, that these are the start-date, end-date, start-date-estimated, and end-date-estimated fields.) The rightmost value should be the latest one (and I can force the sort order to make sure), and as #85 has established, the latest one should be the correct one.

Does that sound feasible to implement without too much trouble? (It sounds like there's an append query that cleans up fields after staging, so it wouldn't be too hard to implement it there?) Something tells me that #85 might take a while, and this would let us work around it pretty reliably (indeed, so reliably that I wonder if we should lower the priority of #85)...

olderwoman commented 4 years ago

I agree that once we know what is happening, this is a relatively easy way to work around the bug. We just wanted to be sure we knew this would capture the correct value.

From: davidskalinder notifications@github.com Sent: Wednesday, June 3, 2020 3:30 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Make sure MAI-to-pass-2 handover file imports to MS Access correctly (#87)

Okay, sounds like this one is pretty close to being in hand.

So I've been thinking about #85https://github.com/davidskalinder/mpeds-coder/issues/85 (I wonder why?), which I think is still causing most of the errors in the data itself? I think the best way to handle the problem for now is simply to saw off everything but the rightmost value in any concatenated fields that have multiple entries but shouldn't. (I think, off the top of my head, that these are the start-date, end-date, start-date-estimated, and end-date-estimated fields.) The rightmost value should be the latest one (and I can force the sort order to make sure), and as #85https://github.com/davidskalinder/mpeds-coder/issues/85 has established, the latest one should be the correct one.

Does that sound feasible to implement without too much trouble? (It sounds like there's an append query that cleans up fields after staging, so it wouldn't be too hard to implement it there?) Something tells me that #85https://github.com/davidskalinder/mpeds-coder/issues/85 might take a while, and this would let us work around it pretty reliably (indeed, so reliably that I wonder if we should lower the priority of #85https://github.com/davidskalinder/mpeds-coder/issues/85)...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/87#issuecomment-638444545, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5OWVGVTIBVBS6KHQ5LRU2XDZANCNFSM4NKYMYHA.

davidskalinder commented 4 years ago

(and I can force the sort order to make sure)

Commit 94019d68 should do this. Merged into all the main branches and deployments.

I agree that once we know what is happening, this is a relatively easy way to work around the bug. We just wanted to be sure we knew this would capture the correct value.

Yeah, I'm increasingly confident that this will do the trick for now. Granted, we don't quite know for sure why #85 is happening, but I'm now at least pretty sure that the last entry in the DB is the one that coders always see, which means that it's the one we should use for analysis.

johnklemke commented 4 years ago

Yes, agreed -- easy to get just the last instance in the append query, and since we're really working with a small database, I might even just do this for all fields that should be single-valued, like all the yes/no fields.

On 6/3/2020 4:06 PM, davidskalinder wrote:

(and I can force the sort order to make sure)

Commit 94019d6 https://github.com/davidskalinder/mpeds-coder/commit/94019d68e0c2c65b6b5c13b7816fa55cc6e0e9c1 should do this. Merged into all the main branches and deployments.

I agree that once we know what is happening, this is a relatively
easy way to work around the bug. We just wanted to be sure we knew
this would capture the correct value.

Yeah, I'm increasingly confident that this will do the trick for now. Granted, we don't /quite/ know for sure why #85 https://github.com/davidskalinder/mpeds-coder/issues/85 is happening, but I'm now at least pretty sure that the last entry in the DB is the one that coders always see, which means that it's the one we should use for analysis.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/87#issuecomment-638461401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEHSAXMTWSEZLS5CSUTRU23M5ANCNFSM4NKYMYHA.

davidskalinder commented 4 years ago

Yes, agreed -- easy to get just the last instance in the append query, and since we're really working with a small database, I might even just do this for all fields that should be single-valued, like all the yes/no fields.

Hah, if you say so -- there's still plenty of fields! (Maybe you have a good trick to do them all at once?)

More seriously, it does occur to me that if something goes wrong with the other fields, it might be better to raise an error so that we notice it?

johnklemke commented 4 years ago

Yeah, you're right. Better to know and then work around than simply to hide it.

On 6/3/2020 4:17 PM, davidskalinder wrote:

Yes, agreed -- easy to get just the last instance in the append
query, and since we're really working with a small database, I
might even just do this for all fields that should be
single-valued, like all the yes/no fields.

Hah, if you say so -- there's still plenty of fields! (Maybe you have a good trick to do them all at once?)

More seriously, it does occur to me that if something goes wrong with the other fields, it might be better to raise an error so that we notice it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/87#issuecomment-638466692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEDSHVKHXCPKCNY42HDRU24VBANCNFSM4NKYMYHA.

johnklemke commented 4 years ago

Progress note of sorts on this issue. My focus remains on developing the necessary forms and behind-the-scenes logic to provide a Pass 2 workflow for the black newspaper data (mostly this means small modifications to copies of the forms and code I developed for the newswire Pass 2 workflow). But I'll note that the append query I'm currently using to generate tblBnpCoderLines from tblBnpRawImport does not yet handle start dates and end dates that have pipes in them, so 59 conversion errors result in null dates in tblBnpCoderLines. (All of this is in gdelt/Black Newspaper Import to Access/ImportTest.accdb.) I'm going to tolerate those nulls while forging ahead with the workflow development and testing. After that's ready for testing and trial by others, I'll go back to the append query and fix its date handling. I'm confident this can all be addressed in Access without further changes to the work David has done.

davidskalinder commented 4 years ago

I think several iterations of this file have been imported successfully, so I'm going to close this; we can repoen if something does go wrong.