Closed davidskalinder closed 4 years ago
Originally posted by @johnklemke in https://github.com/davidskalinder/mpeds-coder/issues/43#issuecomment-585529781:
We do have the SOLR_ID in the coder table file Pam generated on Monday. There's also some article metadata that will need to go into the articles table but not the coder lines table. Doesn't hurt to have this in coder table files, though -- if we come up wiat a way that gets us the full article text and the metadata as well for loading into the article table, i'll just ignore it in coder tables.
So... what metadata are we talking here? There is an article_metadata table in the MAI database, but it's just what I output the other day: MAI ID, SOLR ID, title, and (thanks to #39!) publication date.
@olderwoman, maybe it's worth tracking down the file that was used for this last time so that I know what to replicate?
I think this is a double loop. We will need full text, but this can wait until after we get pass 1 production going on MAI.
From: davidskalinder notifications@github.com Sent: Thursday, February 13, 2020 11:08 AM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Export file with SOLR ID and article text (#45)
Originally posted by @johnklemkehttps://github.com/johnklemke in #43 (comment)https://github.com/davidskalinder/mpeds-coder/issues/43#issuecomment-585529781:
We do have the SOLR_ID in the coder table file Pam generated on Monday. There's also some article metadata that will need to go into the articles table but not the coder lines table. Doesn't hurt to have this in coder table files, though -- if we come up wiat a way that gets us the full article text and the metadata as well for loading into the article table, i'll just ignore it in coder tables.
So... what metadata are we talking here? There is an article_metadata table in the MAI database, but it's just what I output the other day: MAI ID, SOLR ID, title, and (thanks to #39https://github.com/davidskalinder/mpeds-coder/issues/39!) publication date.
@olderwomanhttps://github.com/olderwoman, maybe it's worth tracking down the file that was used for this last time so that I know what to replicate?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/45?email_source=notifications&email_token=ADBJJ5KJSJ7NOYMM2CAB5TDRCV45FA5CNFSM4KUHR3AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELVZNFY#issuecomment-585864855, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5ITFXCM4QQWRGCQDWTRCV45FANCNFSM4KUHR3AA.
So, I think in order to be able to understand how to do this I'm going to need some more info about where this came from in the past.
@alexhanna, do you remember how/whether you produced this for @olderwoman before? To be clear, I think we're just talking about getting a CSV or some such with SOLR IDs and article text, and maybe some of the other fields that are in SOLR, for all 7Kish articles that we've now got loaded into our production MAI.
To help pin this down The files with text included were created by Alex on 8/13/2017 and 9/6/2017 1076.1k 8/13/17 13:58 us-newswires-black-protest_updated-2017-08-13.csv 49.7M 8/13/17 13:59 us-newswires-black-protest_updated-text-2017-08-13.csv 50.9M 9/06/17 13:41 us-newswires-black-protest_updated-text-location-2017-09-06.csv 12.3M 9/06/17 13:41 us-newswires-black-protest_updated-text-us-location-2017-09-06.csv And the older boycott only files were created on July 18, 2016 (Looks like I edited the csv file more recently) 4884.5k 1/01/19 22:48 us-newswires-black-protest-text_2016-07-18.csv 1671.4k 7/18/16 8:36 us-newswires-black-protest-text_2016-07-18.xlsx 75.1k 7/18/16 8:05 us-newswires-black-protest_2016-06-03.csv
From: davidskalinder notifications@github.com Sent: Thursday, February 20, 2020 10:48 AM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Export file with SOLR ID and article text (#45)
So, I think in order to be able to understand how to do this I'm going to need some more info about where this came from in the past.
@alexhannahttps://github.com/alexhanna, do you remember how/whether you produced this for @olderwomanhttps://github.com/olderwoman before? To be clear, I think we're just talking about getting a CSV or some such with SOLR IDs and article text, and maybe some of the other fields that are in SOLR, for all 7Kish articles that we've now got loaded into our production MAI.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/45?email_source=notifications&email_token=ADBJJ5NWGYDX4PZ325YBRWDRD2X3NA5CNFSM4KUHR3AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMPCLCY#issuecomment-589178251, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5KXVQW4ALMR6C4BIXTRD2X3NANCNFSM4KUHR3AA.
Notes from call with @alexhanna: in the past these were produced as part of an MPEDS run. However, there's code in the MPEDS repo (in the solr.py file I think?) that nicely wraps the SOLR API calls to make it not-too-painful to get stuff like this into a json object, at least, which from there can be put into something like a pandas dataframe and dumped to a CSV. The file select-black-protest-articles.py (emailed) does this as part of the MPEDS run.
It's not clear to me that any of the bits of code mentioned above both use SOLR IDs and dump to a CSV. However the code in the controller that gets the article text is instructive: it might be easiest to simply roll a new query-submitter that builds queries like this: [domain]/solr/mpeds2/select?q=id:"The-Tennessee-Tribune;-Nashville,-Tenn._2011-03-10_0e4ca1c546c9400e5f005c3e9e69a11e"
Part of the trick here is figuring out where to put this functionality. It needs to query the MAI database, so it makes sense to put it in the controller and to import solr.py as a module. So I guess that means it might as well be activated by a button or a link or something in the admin interface...
OTOH, it might be possible to get something directly from the SOLR back-end interface using the /export function, but I haven't figured out how to work this yet...
New plan for this one: put MPEDS's solr.py code into MAI (since it builds the query result row-by-row), hook it from a link or some such on the export page (#57), and then, ideally, produce it for download (as the coder-table download used to).
So this now depends on #57.
So as part of #61, I've now produced a CSV file with all the SOLR fields for every article in the production deployment. Some of the formatting is a little scruffy at this point -- notably, many of the fields (the ones that technically allow multiple entries, though none of them currently contain more than one) are enclosed in square brackets. I don't think this is the case for SOLR ID (which is called id
) or the article text (which is called TEXT
) though; so in theory we could use this export as a pass 2 input.
@olderwoman / @johnklemke, what do you think is best at this stage? Do we want to use this to get pass 2 going for these articles, or shall we simply use this file to decide which SOLR fields to keep (that is, for #59) and then worry about polishing it for pass 2 input later on?
At the moment, the file is at gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv
, but of course I can move it somewhere more convenient if we want, just let me know.
We could also/instead do a subteam meeting to discuss? If so then maybe send me an email and we can schedule it there.
As I implied in other comments, I think we are going to need to review these fields one by one, looking at their contents for articles from different sources. You have to keep in mind that we are not just analyzing these in MPEDS, we need to be able to trace back to where the article came from.
As I implied in other comments, I think we are going to need to review these fields one by one, looking at their contents for articles from different sources. You have to keep in mind that we are not just analyzing these in MPEDS, we need to be able to trace back to where the article came from.
Yep, agreed. To be clear though, for this issue I'm only asking about whether this file will work for a pass 2 input. As far as I know from the comments upthread and from looking at the pass 2 DB, the only fields we need for the pass 2 input are SOLR ID, article text, and maaybe internal ID?
If that's correct, then I think all I need to get this issue resolved is to know whether those 2(/3) fields in the existing file will work for pass 2 or if I need to change them somehow...
Ah, sorry. For pass 2 we need SOLR ID, article text, and publication name and publication date. If the latter two are embedded in the SOLR ID, they can be generated after import, but they are needed.
Ah, sorry. For pass 2 we need SOLR ID, article text, and publication name and publication date. If the latter two are embedded in the SOLR ID, they can be generated after import, but they are needed.
I think the DATE
and PUBLICATION
fields in the file should do the trick? (And they're among the small number of fields guaranteed to be in SOLR!) But maybe @johnklemke should have a crack at importing it and see how it goes (unless he can tell it's no good just by looking)?
So as part of #61, I've now produced a CSV file with all the SOLR fields for every article in the production deployment. Some of the formatting is a little scruffy at this point -- notably, many of the fields (the ones that technically allow multiple entries, though none of them currently contain more than one) are enclosed in square brackets. I don't think this is the case for SOLR ID (which is called
id
) or the article text (which is calledTEXT
) though; so in theory we could use this export as a pass 2 input.At the moment, the file is at
gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv
, but of course I can move it somewhere more convenient if we want, just let me know.
@johnklemke, after we talked just now about whether there was already a file with the SOLR IDs and article text, I went back and found this issue so wanted to send you a reminder. As mentioned above, the file at that location has all the SOLR fields in it, some of which have been carried over to the MAI DB (whose data should be treated as canonical I think). So I'd say you should ignore everything in the SOLR output file except for the columns id
and TEXT
.
I could probably clean this export up a bit if necessary now that a) I'm better at querying SOLR and b) we have more of the data in the MAI DB. But maybe we can get everything we need out of the existing export? If not, just let me know.
I see two issues with the export files I'm working with right now in the hopes of giving Morgan some Pass 2 functionality as she reviews recent Pass 1 coding: (1) Pass 2 requires a link in the coder lines to the articles, but I don't see anything that lets me do that in the two csv files. by_coder_and_event_by_annotation_2020-05-19_141901.csv contains a numeric article ID, but I don't see any SOLR IDs, and black_newswires_solr_output_2020-04-09_163435.csv contains SOLR IDs but no numeric event IDs. (I'd love to hear I'm missing something that's there!) (2) I was expecting to see some new-to-me articles in the article text csv, but I don't.
The last files I processed were bpp_coder_articleannotationartfile'_alt.csv where
artfile’ was 2020-04-27095618 and
coder-tablecoderfile'.csv where
coderfile’ was 2020-04-27_095615
I append the article annotation file to the coder-table file. I keep track of the last date I downloaded data and flag lines with a timestamp that have a date greater than or equal to the previous download date. Both of these files contain solr_id as well as the internal article_id.
Hmm, so first of all, we might need to clarify which files we're all looking at? @olderwoman, I don't think anything in this issue (or in issue #83 which @johnklemke references) uses any of the files produced by the Stata scripts. So I think we can rule out any problems with that?
Next up: @johnklemke, it looks like you might be using the SOLR output file from a different database than the current live one? Like I mentioned upthread, the SOLR output for the articles in the bpp
deployment should be in gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv
-- I believe that black_newswires...
file is from a different database. (I should have put the database name in all the filenames, but I hadn't thought of that when I produced the first file; if I run it again, I'll try to be sure to include that.)
Now to the specific questions:
(1) Pass 2 requires a link in the coder lines to the articles, but I don't see anything that lets me do that in the two csv files. by_coder_and_event_by_annotation_2020-05-19_141901.csv contains a numeric article ID, but I don't see any SOLR IDs, and black_newswires_solr_output_2020-04-09_163435.csv contains SOLR IDs but no numeric event IDs. (I'd love to hear I'm missing something that's there!)
I assume by a link you just mean (SOLR) IDs in each file to match on? In the SOLR output this is just called id
; in the MAI output it should be called db_id
. When I can't remember what it's called in each place (a frequent occurrence), I look for the fields whose content is [publication name]_[date]_[hexadecimal hash]
.
(2) I was expecting to see some new-to-me articles in the article text csv, but I don't.
So this might be the result of having the wrong SOLR output file? Although it is also true that the articles in our MAI deployment now are a subset of articles in the old deployment at the blacknews
address, so depending on what you already have in pass 2 it might still be the case that the SOLR output from the current deployment doesn't give you any new information...
Anyway I hope that clarifies things? But of course let me know if not.
All right! I made the mistake of grabbing the SOLR file with the most recent date; going to the ..._162312 file looks more promising.
db_id, huh? Okay -- I was chasing after the much more alluring article_id. I don't know yet if the length of the SOLR IDs is gonna be a problem.
Sorry for the alarm; I think I can get back on track.
On 5/20/2020 2:30 PM, davidskalinder wrote:
Hmm, so first of all, we might need to clarify which files we're all looking at? @olderwoman https://github.com/olderwoman, I don't think anything in this issue (or in issue #83 https://github.com/davidskalinder/mpeds-coder/issues/83 which @johnklemke https://github.com/johnklemke references) uses any of the files produced by the Stata scripts. So I think we can rule out any problems with that?
Next up: @johnklemke https://github.com/johnklemke, it looks like you might be using the SOLR output file from a different database than the current live one? Like I mentioned upthread https://github.com/davidskalinder/mpeds-coder/issues/45#issuecomment-610638844, the SOLR output for the articles in the |bpp| deployment should be in |gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv| -- I believe that |black_newswires...| file is from a different database. (I should have put the database name in all the filenames, but I hadn't thought of that when I produced the first file; if I run it again, I'll try to be sure to include that.)
Now to the specific questions:
(1) Pass 2 requires a link in the coder lines to the articles, but I don't see anything that lets me do that in the two csv files. by_coder_and_event_by_annotation_2020-05-19_141901.csv contains a numeric article ID, but I don't see any SOLR IDs, and black_newswires_solr_output_2020-04-09_163435.csv contains SOLR IDs but no numeric event IDs. (I'd love to hear I'm missing something that's there!)
I assume by a link you just mean (SOLR) IDs in each file to match on? In the SOLR output this is just called |id|; in the MAI output it should be called |dbid|. When I can't remember what it's called in each place (a frequent occurrence), I look for the fields whose content is |[publication name][date]_[hexadecimal hash]|.
(2) I was expecting to see some new-to-me articles in the article text csv, but I don't.
So this might be the result of having the wrong SOLR output file? Although it is also true that the articles in our MAI deployment now are a subset of articles in the old deployment at the |blacknews| address, so depending on what you already have in pass 2 it might still be the case that the SOLR output from the current deployment doesn't give you any new information...
Anyway I hope that clarifies things? But of course let me know if not.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/45#issuecomment-631679107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEHNYJXOY2S5QUAXU3DRSQVULANCNFSM4KUHR3AA.
db_id, huh? Okay -- I was chasing after the much more alluring article_id. I don't know yet if the length of the SOLR IDs is gonna be a problem.
Yeah, article_id
is MAI's ID (it's id
in the article annotation table, and I put the article_
prefix on everything in that table when I join it to the others).
I think you can trick Access into using keys of any length, but if not then you could always assign some new ones during staging? And then keep the SOLR ID in there as a regular old text field or some such? Anyway, it probably depends on the details -- let me know if you want me to have a closer look.
Sorry for the alarm; I think I can get back on track.
No worries! Feel free to re-ring the alarm if the track leads you astray again, heh.
I'm struggling with loading the articles into Access -- it's seeing far too many columns in the .csv. I've tried tweaking the Text Qualifier, but no joy. Pam says she's run into similar problems with loading into Stata and that you provided "alt files" that did work. Is that something we can try for this need?
I'm struggling with loading the articles into Access -- it's seeing far too many columns in the .csv. I've tried tweaking the Text Qualifier, but no joy.
Hmm... too many columns for what? :) It looks like there are 37 in the SOLR output (of which we only need 2 now, lol) and 62 in the MAI output (of which we probably need all)...
Pam says she's run into similar problems with loading into Stata and that you provided "alt files" that did work. Is that something we can try for this need?
The only thing we did with the "alt" versions of those exports was to change the method of escaping quotes. That file was a direct dump from MySQL, whereas the MAI and SOLR exports are written to CSV by pandas
, so it's unlikely to be the same problem.
I just realized that I could just try importing the files myself? That way I can at least check the file format settings...
John is off picking up your dinner so I’ll say: good idea to test an import yourself and see what happens. It seems likely he is being bitten by the unmatched quotes or stray escape codes problems.
Yep, such issues abound with data passes. I'm going to make a dead copy of the Access DB, import into that, and write down whatever I do that works...
Well, Access's CSV import is worse than I remember!
So there are multiple issues here, including (at least) something strange that Access is doing with quoted fields; but that's not the only problem, because even when it appears to get fields right, Access tells me that "An error occurred trying to save import/export specification" (which I have not requested Access to do) when trying to import the SOLR output file.
So this will require some careful debugging that unfortunately I won't have time to do until next week. In the meantime, Excel seems to have much less trouble opening both files (although it still fails to recognize that they are encoded as UTF-8, which causes some malformed characters in the MAI data and seems to cause a bad line break in the SOLR data). I managed to open the MAI data in Excel, save it as an Excel workbook, and import it to a dummy Access DB, although there were import errors. I'd recommend you try that and work with whatever it gets you for now?
Unfortunately as mentioned above though, this didn't work with the SOLR file even though it seemed to improve the field quoting. It might be worth seeing if you can at least get some of the lines of the file to import so that you can see the structure before I get a chance to debug it? FWIW, LibreOffice Calc opens both files flawlessly, although I can't get it to write to a form that Access likes.
Sorry I can't be more help for now. In general I'd say for now feel free to hack whatever you can get out of these files so you can mock things up and then I'll try to fix it properly next week...
Yeah, I'll go after this more -- just hoping there was some secret alt-sauce you'd already cooked up. Access does seem a bit behind Excel in this, doesn't it. In the end, we may well need to go after this together, but I'll own it for the balance of the week at least.
On 5/20/2020 7:52 PM, davidskalinder wrote:
Well, Access's CSV import is worse than I remember!
So there are multiple issues here, including (at least) something strange that Access is doing with quoted fields; but that's not the only problem, because even when it appears to get fields right, Access tells me that "An error occurred trying to save import/export specification" (which I have not requested Access to do) when trying to import the SOLR output file.
So this will require some careful debugging that unfortunately I won't have time to do until next week. In the meantime, Excel seems to have much less trouble opening both files (although it still fails to recognize that they are encoded as UTF-8, which causes some malformed characters in the MAI data and seems to cause a bad line break in the SOLR data). I managed to open the MAI data in Excel, save it as an Excel workbook, and import it to a dummy Access DB, although there were import errors. I'd recommend you try that and work with whatever it gets you for now?
Unfortunately as mentioned above though, this didn't work with the SOLR file even though it seemed to improve the field quoting. It might be worth seeing if you can at least get some of the lines of the file to import so that you can see the structure before I get a chance to debug it? FWIW, LibreOffice Calc opens both files flawlessly, although I can't get it to write to a form that Access likes.
Sorry I can't be more help for now. In general I'd say for now feel free to hack whatever you can get out of these files so you can mock things up and then I'll try to fix it properly next week...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/45#issuecomment-631814017, or unsubscribe https://github.com/notifications/unsubscribe-auth/AORKHEFNXBXYWNRDEYKG3ATRSR3LXANCNFSM4KUHR3AA.
2aa95b3 (and f27e6f76) clean up the list-columns (multiple-entry cells, of which I don't think there are any, are concatenated with |||
). Merged into the main branches and tested in all three deployments.
Okay, @johnklemke (and @olderwoman), for reasons I don't really understand (since the only effect of this should have been to remove [
and ]
characters from the output), cleaning up the multiple-entry columns seems to fix import into MS Access. I had almost no problems importing this into a new table after setting the Text Qualifier to "
and the Code Page to UTF-8. It looks like it did truncate most of the Links
and a few of the Subject
entries, but, well, I don't think we really care about those anyway.
So have a crack at the new file (the one that starts with bpp_
) in gdelt/Skalinder/SOLR_exports
? Meantime I'll move this issue to the user testing column, but if everything works let me know and I can close it.
John will need to do the Access test. I’m just now getting free to look at the .csv file using Stata to do some checks. Will report.
From: davidskalinder notifications@github.com Sent: Thursday, May 28, 2020 3:11 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Export file with SOLR ID and article text (#45)
Okay, @johnklemkehttps://github.com/johnklemke (and @olderwomanhttps://github.com/olderwoman), for reasons I don't really understand (since the only effect of this should have been to remove [ and ] characters from the output), cleaning up the multiple-entry columns seems to fix import into MS Access. I had almost no problems importing this into a new table after setting the Text Qualifier to " and the Code Page to UTF-8. It looks like it did truncate most of the Links and a few of the Subject entries, but, well, I don't think we really care about those anyway.
So have a crack at the new file (the one that starts with bpp_) in gdelt/Skalinder/SOLR_exports? Meantime I'll move this issue to the user testing column, but if everything works let me know and I can close it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/45#issuecomment-635575107, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5PHBFMEULTNGPATXB3RT3AOFANCNFSM4KUHR3AA.
This file imported into Stata with no woes.
From: davidskalinder notifications@github.com Sent: Thursday, May 28, 2020 3:11 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Export file with SOLR ID and article text (#45)
Okay, @johnklemkehttps://github.com/johnklemke (and @olderwomanhttps://github.com/olderwoman), for reasons I don't really understand (since the only effect of this should have been to remove [ and ] characters from the output), cleaning up the multiple-entry columns seems to fix import into MS Access. I had almost no problems importing this into a new table after setting the Text Qualifier to " and the Code Page to UTF-8. It looks like it did truncate most of the Links and a few of the Subject entries, but, well, I don't think we really care about those anyway.
So have a crack at the new file (the one that starts with bpp_) in gdelt/Skalinder/SOLR_exports? Meantime I'll move this issue to the user testing column, but if everything works let me know and I can close it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/45#issuecomment-635575107, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5PHBFMEULTNGPATXB3RT3AOFANCNFSM4KUHR3AA.
I have succeeded in Access-side testing at importing the fields Pam designated as needed into a table consisting of just those fields. The fields are named as in the .csv header line, and all are imported as text fields. I've used an Access query to append a subset of this data in this table (just the fields needed for Pass 2, but all rows) to tblArticles. This will be tested soon as I get Pass 2 under test for the black newspaper data. Probably could be closed now, or held until we can run the whole shebang and bless it all.
Okay, sounds good! I'll close this for now -- if problems arise in the future we can re-open this if they're closely-enough related to make the info in this thread useful, or otherwise we can open a new issue.
This is holding up import to pass 2 atm. @johnklemke has this for the newswire articles, but not for the black newspaper articles. At some point in the past @alexhanna generated this somehow, and obviously there's some good way to fish all the text out of SOLR and into a csv or something, but I don't know what that way is yet.