davidskalinder / mpeds-coder

MPEDS Annotation Interface
MIT License
0 stars 0 forks source link

Manually export coder-article-level data #56

Closed davidskalinder closed 4 years ago

davidskalinder commented 4 years ago

During meeting, ran SELECT "id", "article_id", "variable", "value", "text", "coder_id", "timestamp" UNION ALL SELECT * FROM coder_article_annotation INTO OUTFILE '/var/lib/mysql-files/coder_article_annotation_2020-03-02' CHARACTER SET utf8 FIELDS TERMINATED BY ',' ESCAPED BY '\"' ENCLOSED BY '\"';; result had some split fields, an unmatched quote before the text field, and Excel generally didn't like it much.

davidskalinder commented 4 years ago

Pulling is now implemented by an SQL file that can be run with sudo cat /var/lib/mysql-files/output_coder-article-annotation.sql | mysql -u mpeds. This file is not under version control since it's specific to the deployments. This needs to be moved (by root, and then chowned) to somewhere more sensible. SQL is as follows:

USE xxx;

SET @sql_text = 
  CONCAT(
    "SELECT 'id'
      , 'article_id'
      , 'variable'
      , 'value'
      , 'text'
      , 'coder_id'
      , 'timestamp'
    UNION ALL SELECT coder_article_annotation.id
      , coder_article_annotation.article_id
      , coder_article_annotation.variable
      , coder_article_annotation.value
      , coder_article_annotation.text
      , user.username
      , coder_article_annotation.timestamp
      FROM coder_article_annotation
      INNER JOIN user
      ON coder_article_annotation.coder_id = user.id
    INTO OUTFILE '/var/lib/mysql-files/"
    , DATABASE()
    , "_coder_article_annotation_"
    , CURDATE()
    , "-"
    , TIME_FORMAT(CURTIME(), '%H-%i-%s')
    , ".csv'
      CHARACTER SET utf8
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\\\'
      ENCLOSED BY '\"'"
  );

PREPARE s1 FROM @sql_text;
EXECUTE s1;
DROP PREPARE s1;
davidskalinder commented 4 years ago

So @olderwoman, there are now two versions of this file that you can try to see if they work. The one I'd ask you to try first is at /var/www/bpp/bpp_coder_article_annotation_2020-03-02-18-38-27.csv. That one is comma delimited, with fields enclosed by the " character and all special characters escaped with a \ character (including line breaks within fields, " characters within fields, and NULL values represented as \N, all of which this file contains). LibreOffice Calc opens this file with no problems and I'm hoping your Stata script will as well.

Excel, however, does not, because it expects " within fields to be escaped by another " (but doesn't enjoy having " used to escape other special characters). MySQL seems to only allow one escape character for everything, so I can't find an elegant way to output this so that Excel will like it. For the time being, in case you need the CSV exactly the way Excel wants it, I've manually edited the file to change all the \"s to ""s: that file is at /var/www/bpp/bpp_coder_article_annotation_2020-03-02-18-38-27_alt.csv (and Winstat's Excel opens it with no problems I can see). So it'd be better if you can use the first file, but if not then the second one should do in a pinch.

Once this is automated in the UI, the escaping and formatting and everything can be handled by Python instead of MySQL, so we'll have more control over it. But meantime, let me know if one of the files above can do the trick for now?

olderwoman commented 4 years ago

I’ll test it now and report back.

From: davidskalinder notifications@github.com Sent: Monday, March 2, 2020 7:17 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Manually export coder-article-level data (#56)

So @olderwomanhttps://github.com/olderwoman, there are now two versions of this file that you can try to see if they work. The one I'd ask you to try first is at /var/www/bpp/bpp_coder_article_annotation_2020-03-02-18-38-27.csv. That one is comma delimited, with fields enclosed by the " character and all special characters escaped with a \ character (including line breaks within fields, " characters within fields, and NULL values represented as \N, all of which this file contains). LibreOffice Calc opens this file with no problems and I'm hoping your Stata script will as well.

Excel, however, does not, because it expects " within fields to be escaped by another " (but doesn't enjoy having " used to escape other special characters). MySQL seems to only allow one escape character for everything, so I can't find an elegant way to output this so that Excel will like it. For the time being, in case you need the CSV exactly the way Excel wants it, I've manually edited the file to change all the \"s to ""s: that file is at /var/www/bpp/bpp_coder_article_annotation_2020-03-02-18-38-27_alt.csv (and Winstat's Excel opens it with no problems I can see). So it'd be better if you can use the first file, but if not then the second one should do in a pinch.

Once this is automated in the UI, the escaping and formatting and everything can be handled by Python instead of MySQL, so we'll have more control over it. But meantime, let me know if one of the files above can do the trick for now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/56?email_source=notifications&email_token=ADBJJ5J5F2VPX4CEYKKKWMDRFRLA3A5CNFSM4K76GAV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENRWFJQ#issuecomment-593715878, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5N7EEW3CR2Q5LTXTTTRFRLA3ANCNFSM4K76GAVQ.

olderwoman commented 4 years ago

I think we are a “go” with the alt file.

Using the option for strict binding of quotes seemed to work ok on the alt file, all rows look good import delimited "V:\gdelt\coder_stats_2020\bpp\bpp_coder_article_annotation_2020-03-02-18-38-27_alt.csv", delimiter(comma) bindquote(strict) varnames(1) stripquote(yes) clear

In the original file, the defaults produced the same failures we saw when opening the csv in Excel. The best I got was with the options below (strict binding of quotes), but one row managed to break the original file with all possible choices of options on the input. import delimited "V:\gdelt\coder_stats_2020\bpp\bpp_coder_article_annotation_2020-03-02-18-38-27.csv", delimiter(comma) bindquote(strict) varnames(1) stripquote(yes) clear The best was this, avoided most problems, but there was still one row that failed to parse correctly id article_id variable value text coder_id timestamp v8 1191 2985 article-desc Mandela - and a large contingent of soldiers and police officers accompanied Xolo to a rally at which the president declared he would not tolerate \no-go areas \ regions barred to supporters of rival political groups. Conflict in South Africa. \N nesya 2020-02-28 16:53:49

From: davidskalinder notifications@github.com Sent: Monday, March 2, 2020 7:17 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Manually export coder-article-level data (#56)

So @olderwomanhttps://github.com/olderwoman, there are now two versions of this file that you can try to see if they work. The one I'd ask you to try first is at /var/www/bpp/bpp_coder_article_annotation_2020-03-02-18-38-27.csv. That one is comma delimited, with fields enclosed by the " character and all special characters escaped with a \ character (including line breaks within fields, " characters within fields, and NULL values represented as \N, all of which this file contains). LibreOffice Calc opens this file with no problems and I'm hoping your Stata script will as well.

Excel, however, does not, because it expects " within fields to be escaped by another " (but doesn't enjoy having " used to escape other special characters). MySQL seems to only allow one escape character for everything, so I can't find an elegant way to output this so that Excel will like it. For the time being, in case you need the CSV exactly the way Excel wants it, I've manually edited the file to change all the \"s to ""s: that file is at /var/www/bpp/bpp_coder_article_annotation_2020-03-02-18-38-27_alt.csv (and Winstat's Excel opens it with no problems I can see). So it'd be better if you can use the first file, but if not then the second one should do in a pinch.

Once this is automated in the UI, the escaping and formatting and everything can be handled by Python instead of MySQL, so we'll have more control over it. But meantime, let me know if one of the files above can do the trick for now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/56?email_source=notifications&email_token=ADBJJ5J5F2VPX4CEYKKKWMDRFRLA3A5CNFSM4K76GAV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENRWFJQ#issuecomment-593715878, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5N7EEW3CR2Q5LTXTTTRFRLA3ANCNFSM4K76GAVQ.

davidskalinder commented 4 years ago

Sigh, okay, good enough for now. I'll close this and work on the proper version.

davidskalinder commented 4 years ago

Reopening this because #58 is complex enough that we might need to improve this as a backup in the meantime.

@olderwoman mentions the following in an email:

getting the article annotations to download into the same spreadsheet with the documents would be very helpful. Right now, the solr_id and publication didn’t come with them, which is why I needed additional programming to merge the article descriptions, because the regions are defined by publication city, then I had to assign coders who had no-protest articles to regions by what other publications they were coding.

Ah, I see, yes. I forgot that some article annotations wouldn't have coder-table rows associated with them. I can easily include the SOLR IDs in the article export.

Am I right that we're always getting the publication info by chiseling it out of the SOLR ID? If so that's... not great. If we really need publication name, it should be an MAI field.

If the article-level annotations came in the same spreadsheet and included solr_id it would be very easy to process from them.

It turns out this will be more difficult since the MAI tables are long-form, so there's no sensible way I can see to join them directly. We'll have to pivot them to wide-form and join them afterward, which is involved enough that I'll leave it for #58.

davidskalinder commented 4 years ago

Am I right that we're always getting the publication info by chiseling it out of the SOLR ID? If so that's... not great. If we really need publication name, it should be an MAI field.

Ah, never mind, I see that they're in the coder-table export somehow. I'll track down how this is done, although I might very well simply create MAI fields for these things anyway.

davidskalinder commented 4 years ago

Am I right that we're always getting the publication info by chiseling it out of the SOLR ID? If so that's... not great. If we really need publication name, it should be an MAI field.

Ah, never mind, I see that they're in the coder-table export somehow. I'll track down how this is done, although I might very well simply create MAI fields for these things anyway.

So, yeah, MAI is simply splitting the SOLR ID string. Which isn't great.

davidskalinder commented 4 years ago

Notes from meeting: for now, keep this as a two-table ETL (though hopefully run within a few seconds to avoid sync issues), ensuring that SOLR ID is in both exports so that we can get publication info from it. @olderwoman can continue to use Stata scripts to knit these together until we kill SOLR and start getting handier with exports from MAI.

davidskalinder commented 4 years ago

Okay @olderwoman, there should now be a new pair of files in the /var/www/bpp folder, one of which is an old-style coder table and one of which is the _alt version of the article-level data, now with a SOLR ID in it. Can you try these out when you get the chance and make sure you can get what you need from them? If so then that that should give us a good-enough version to use until we can fix things properly...

davidskalinder commented 4 years ago

Changed filenames to look like bpp_coder_article_annotation_2020-03-16_145339.csv.

@olderwoman, this might require an update to your scripts? I change the name manually when I create the _alt files anyway, so if it needs a change it should be easy to tweak that way if necessary...

olderwoman commented 4 years ago

Um ok? Didnot realize you needed my permission to do this.

davidskalinder commented 4 years ago

Um ok? Didnot realize you needed my permission to do this.

You're right, not really -- in fact, I already used the new names in this week's exports. But I wanted to make sure the new filename format works for you before I close the issue. So it sounds like I can do that now...

olderwoman commented 4 years ago

It worked fine.

From: davidskalinder notifications@github.com Sent: Friday, April 3, 2020 8:23 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Manually export coder-article-level data (#56)

Um ok? Didnot realize you needed my permission to do this.

You're right, not really -- in fact, I already used the new names in this week's exports. But I wanted to make sure the new filename format works for you before I close the issue. So it sounds like I can do that now...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/56#issuecomment-608949800, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5KRLIHW3QCN2PIEM43RK2DYNANCNFSM4K76GAVQ.

davidskalinder commented 4 years ago

So, part of this issue is the need to be able to identify pairs of files as discussed in https://github.com/davidskalinder/mpeds-coder/issues/62#issuecomment-608954006. So I reopened this since the two issues are linked...

davidskalinder commented 4 years ago

Same as #62 and #66, this part of the process seems stable for now, so I'll close this issue.