davidskalinder / mpeds-coder

MPEDS Annotation Interface
MIT License
0 stars 0 forks source link

Kill SOLR? #59

Closed davidskalinder closed 3 years ago

davidskalinder commented 4 years ago

This would be a big change under the hood of MAI, so I'm mentioning the whole team here since I don't want to do this without consultation: @olderwoman, @matthewsmorgan, @chaeyoonlim, @johnklemke. Also I think deciding this is a fairly high priority, since it will affect how I implement a bunch of other stuff (like pressing issues #58, #56, #45, as well as slightly-less-pressing issues #2 and possibly #53); and I know some people will be out of town the next few Mondays so I thought it'd be a good idea to raise the issue here. So please let me know if you have opinions, or if I can help clarify anything...

I'm thinking of excising SOLR from MAI altogether. Not right away necessarily, but moving in that direction. @alexhanna confirms that SOLR is part of MAI mostly for legacy reasons and because it's useful for MPEDS. Below are the pros and cons I can think of.

Pro:

  1. MAI and subsequent work needs a bunch of information that right now is only stored in SOLR. We've recently put publication date into MAI, and we currently have needs for publication name, publication location, and article text; but I expect more will arise.
    1. This info usually needs to be joined in clever ways to info that only MAI stores (e.g., to show pub name for articles in the current project, or that coders have already touched, or that have been assigned, etc. etc.). These joins are much harder when the info is stored in SOLR, whereas most would be fairly trivial if the info were in the MAI DB.
  2. Removing SOLR from MAI would simplify MAI setup considerably.
  3. We could more strictly enforce some constraints on the data (such as ensuring there are no duplicates, ensuring certain fields are always present, etc.).
  4. All information needed for an MAI project would be strictly associated with that project, not shared between many projects that are all using the same SOLR storage.

Con:

  1. We risk having an extra version of the truth. In theory, the text of an article could change in SOLR and not in MAI.
    1. This is a particular risk during the cross-over period when some fields are in MAI and some aren't. (This has already been the case for a long time with article title, and for a short time with publication date.)
    2. However 1: I don't think we know of any reason why these fields should ever change except due to error.
    3. However 2: There is already an extra version of the truth in the pass 2 DB, and even if we eliminate that there'll still be the files downloaded from the news services; so if the line of what counts as the Truth is being drawn arbitrarily anyway, we might as well draw it after MPEDS has finished using SOLR?
  2. Depending how we do it, we might lose some of the data stored in some of SOLR's more obscure fields.
    1. However 1: We're not using most of these fields anyway (do we care about an article's Number_of_pages, ProQuest_document_ID, or Accession_number? If we do need some, perhaps we could make a list?
    2. However 2: If we need everything, we could in theory load every field into a long-format table in MAI (with columns id, field_name, and field_content or some such); though this would make some joins a little more difficult.
  3. Loading new articles would become more difficult because we'd need all the info for the articles rather than just the SOLR IDs (and optionally, article titles and publication dates) that we need now.
  4. We'd have to worry slightly more about how many articles we store. Tens of thousands is almost certainly fine, millions might be less fine.

Implementation-wise, I think at the moment the only thing that actually uses SOLR at all is the controller's loadSolr function which loads up the article text! If that's right, then there wouldn't be very much to break with this change (in MAI itself anyway). There are also some bits (such as the controller's convertIDToPublication function) that chisel information out of the SOLR IDs and could be implemented better. It's harder to know about pass 2 without knowing exactly what input @olderwoman's Stata scripts expect, although I'm relatively confident that we can reproduce old inputs using information in MAI as well as we would be able to with information in SOLR?

olderwoman commented 4 years ago

Answers to some of these below

From: davidskalinder notifications@github.com Sent: Tuesday, March 3, 2020 4:59 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: [davidskalinder/mpeds-coder] Kill SOLR? (#59)

This would be a big change under the hood of MAI, so I'm mentioning the whole team here since I don't want to do this without consultation: @olderwomanhttps://github.com/olderwoman, @matthewsmorganhttps://github.com/matthewsmorgan, @chaeyoonlimhttps://github.com/chaeyoonlim, @johnklemkehttps://github.com/johnklemke. Also I think deciding this is a fairly high priority, since it will affect how I implement a bunch of other stuff (like pressing issues #58https://github.com/davidskalinder/mpeds-coder/issues/58, #56https://github.com/davidskalinder/mpeds-coder/issues/56, #45https://github.com/davidskalinder/mpeds-coder/issues/45, as well as slightly-less-pressing issues #2https://github.com/davidskalinder/mpeds-coder/issues/2 and possibly #53https://github.com/davidskalinder/mpeds-coder/issues/53); and I know some people will be out of town the next few Mondays so I thought it'd be a good idea to raise the issue here. So please let me know if you have opinions, or if I can help clarify anything...

I'm thinking of excising SOLR from MAI altogether. Not right away necessarily, but moving in that direction. @alexhannahttps://github.com/alexhanna confirms that SOLR is part of MAI mostly for legacy reasons and because it's useful for MPEDS. Below are the pros and cons I can think of.

Pro:

  1. MAI and subsequent work needs a bunch of information that right now is only stored in SOLR. We've recently put publication date into MAI, and we currently have needs for publication name, publication location, and article text; but I expect more will arise.
    • This info usually needs to be joined in clever ways to info that only MAI stores (e.g., to show pub name for articles in the current project, or that coders have already touched, or that have been assigned, etc. etc.). These joins are much harder when the info is stored in SOLR, whereas most would be fairly trivial if the info were in the MAI DB.
  2. Removing SOLR from MAI would simplify MAI setup considerably.
  3. We could more strictly enforce some constraints on the data (such as ensuring there are no duplicates, ensuring certain fields are always present, etc.).
  4. All information needed for an MAI project would be strictly associated with that project, not shared between many projects that are all using the same SOLR storage.

Con:

  1. We risk having an extra version of the truth. In theory, the text of an article could change in SOLR and not in MAI.

    • This is a particular risk during the cross-over period when some fields are in MAI and some aren't. (This has already been the case for a long time with article title, and for a short time with publication date.)
    • However 1: I don't think we know of any reason why these fields should ever change except due to error.
    • However 2: There is already an extra version of the truth in the pass 2 DB, and even if we eliminate that there'll still be the files downloaded from the news services; so if the line of what counts as the Truth is being drawn arbitrarily anyway, we might as well draw it after MPEDS has finished using SOLR? PAM: (1) FWIW we do have the original files we got from gigaword. They are annoying I think because I originally purchased the wrong version, so there is markup that Alex had to strip out. But we have it. Or should have it. It came on a hard drive and I’m pretty sure we made a copy of it somewhere, although I’d have to look for it. (2) The thing that isn’t easy for us to replicate is the results of MPEDS.
  2. Depending how we do it, we might lose some of the data stored in some of SOLR's more obscure fields.

    • However 1: We're not using most of these fields anyway (do we care about an article's Number_of_pages, ProQuest_document_ID, or Accession_number? If we do need some, perhaps we could make a list?
    • However 2: If we need everything, we could in theory load every field into a long-format table in MAI (with columns id, field_name, and field_content or some such); though this would make some joins a little more difficult. PAM: The download information from Proquest probably should be retained. Each source of files will have a different set of fields, so the Gigaword files have different fields from the Proquest files. It would probably make sense to have a landing location for each input of files (Gigaword, Ethnic Newswatch, Proquest searches at this point) that are all files pre-MPEDS, then a storage location for the results of the MPEDS selection process (the subset of articles that go to MAI). We often need to do some rough comparisons of the original bank of stories and the MPEDS results for methodological documentation, and this is one thing it has been hard to generate.
  3. Loading new articles would become more difficult because we'd need all the info for the articles rather than just the SOLR IDs (and optionally, article titles and publication dates) that we need now.

  4. We'd have to worry slightly more about how many articles we store. Tens of thousands is almost certainly fine, millions might be less fine. PAM: So this is because we’d have multiple copies of the same files or because the storage would be less efficient?

Implementation-wise, I think at the moment the only thing that actually uses SOLR at all is the controller's loadSolr function which loads up the article text! If that's right, then there wouldn't be very much to break with this change (in MAI itself anyway). There are also some bits (such as the controller's convertIDToPublication function) that chisel information out of the SOLR IDs and could be implemented better. It's harder to know about pass 2 without knowing exactly what input @olderwomanhttps://github.com/olderwoman's Stata scripts expect, although I'm relatively confident that we can reproduce old inputs using information in MAI as well as we would be able to with information in SOLR?

PAM: My Stata scripts are accessible on gdelt and can be readily parsed for inputs expected.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59?email_source=notifications&email_token=ADBJJ5JSG24HHBKFV4FUINTRFWDRRA5CNFSM4LAVKKV2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ISGAJKQ, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5M6M44CEJLRXIDP673RFWDRRANCNFSM4LAVKKVQ.

davidskalinder commented 4 years ago
  • However 1: I don't think we know of any reason why these fields should ever change except due to error.
  • However 2: There is already an extra version of the truth in the pass 2 DB, and even if we eliminate that there'll still be the files downloaded from the news services; so if the line of what counts as the Truth is being drawn arbitrarily anyway, we might as well draw it after MPEDS has finished using SOLR?

PAM: (1) FWIW we do have the original files we got from gigaword. They are annoying I think because I originally purchased the wrong version, so there is markup that Alex had to strip out. But we have it. Or should have it. It came on a hard drive and I’m pretty sure we made a copy of it somewhere, although I’d have to look for it. (2) The thing that isn’t easy for us to replicate is the results of MPEDS.

Yeah, that's what I meant by the downloaded files -- I'm pretty sure we still have them for all the article sources. So to clarify, my concern is that if anything downstream of those files somehow gets out of sync with them, then it could be difficult to find and reconcile what's changed (though I think my two howevers above mitigate this worry considerably).

  1. Depending how we do it, we might lose some of the data stored in some of SOLR's more obscure fields.
    • However 1: We're not using most of these fields anyway (do we care about an article's Number_of_pages, ProQuest_document_ID, or Accession_number? If we do need some, perhaps we could make a list?
    • However 2: If we need everything, we could in theory load every field into a long-format table in MAI (with columns id, field_name, and field_content or some such); though this would make some joins a little more difficult.

PAM: The download information from Proquest probably should be retained. Each source of files will have a different set of fields, so the Gigaword files have different fields from the Proquest files.

Yes. In fact, there are hundreds of different fields currently stored in SOLR: I believe they vary not only from source to source but from article to article. A few, like publication date, title, text, and (I think) publication name are probably present for every article, but a lot of the fields probably only appear in a few articles.

Also, I think any version of what I'm proposing would still retain the SOLR ID in MAI -- we just wouldn't use it very much. But that would still give us a way to find our way back to any information that's in SOLR about any particular article. So I'm struggling to come up with a way that this issue would make it more difficult than it already is to trace an article back to all the other places it exists... I guess if we stop paying attention to SOLR because most of the stuff we need is in MAI and then the SOLR database gets erased? In which case we could still (albeit with a fair amount of work) search the download files for stuff like the article title and date and so on...

I guess on the other hand, if I don't move a bunch of info to MAI's database, then I'll have to build some more helper tools to query SOLR more efficiently. So there could be an opportunity cost of moving information to MAI and increasingly ignoring SOLR.

It would probably make sense to have a landing location for each input of files (Gigaword, Ethnic Newswatch, Proquest searches at this point) that are all files pre-MPEDS, then a storage location for the results of the MPEDS selection process (the subset of articles that go to MAI).

As it stands, the article information that we got from the news sources lives in: the downloaded files from the sources, the SOLR database that MPEDS uses, and the pass 2 database, all connected with extract-transform-loads. At the moment, MAI stores hardly anything about the articles -- the interface just calls up SOLR and asks for the article info that corresponds to a certain ID.

I'm suggesting that we change that, treat SOLR as a data provider (as we would treat Proquest or the GSS), and store what we need in MAI itself, treating that as our canonical data. So the new setup would be that info we get from the news sources lives in downloaded files from the sources, the SOLR database that MPEDS uses, the MAI database, and the pass 2 DB.

We often need to do some rough comparisons of the original bank of stories and the MPEDS results for methodological documentation, and this is one thing it has been hard to generate.

I think this analysis must happen upstream of MAI anyway? So any changes to MAI or subsequent processes wouldn't change anything, unless I'm missing something?

  1. We'd have to worry slightly more about how many articles we store. Tens of thousands is almost certainly fine, millions might be less fine.

PAM: So this is because we’d have multiple copies of the same files or because the storage would be less efficient?

More like the second one. I suppose we would be using more disk space technically, but I don't think that's much of a worry. It's more that I don't know how fast MySQL could do things like sort millions of 5,000-word articles, whereas SOLR has lots of tricks to do stuff like that quickly.

It's harder to know about pass 2 without knowing exactly what input @olderwomanhttps://github.com/olderwoman's Stata scripts expect, although I'm relatively confident that we can reproduce old inputs using information in MAI as well as we would be able to with information in SOLR?

PAM: My Stata scripts are accessible on gdelt and can be readily parsed for inputs expected.

True! Although occasionally the specifics of the fields they expect are a little hard to locate (for me, anyway!). But as I say, bringing some SOLR fields into MAI should make it easier to access more information (indeed this is the biggest motivation for it), so I'm pretty confident that we can reproduce the current inputs one way or another if we need. It's possible that the changes could result in annoyances like changed column names, for example, but we should be able to track these down or work around them.

I recognize there's a lot of complexities with this issue; maybe let me know if you want to set up a call (perhaps with an appropriate subteam) to discuss synchronously?

davidskalinder commented 4 years ago

Outcome from team meeting: kill, baby, kill. We should use a long table, both to get as much as possible from SOLR and to stay flexible for the possibility of different fields in future article batches.

I'm leaving this issue open for now until details get moved into the right implementation threads.

alexhanna commented 4 years ago

I am just getting to reading this right now.

I think storing as a long table loses a lot of the join power you get from moving Solr to MySQL. Even if there are a number of different columns which appear for different publications in Solr, it makes sense to more firmly structure the MySQL schema for articles.

I do think the biggest problem with this will be size and memory. Again, I don't know how MySQL performs with big text blobs memory-wise. But given that we will be using a unique index to store Solr IDs, memory shouldn't (?) be an issue because we're not churning through big unstructured text blobs to find particular strings.

THAT SAID: @olderwoman, if you did want to do full-text searches like you have in the past to identify articles that mentioned particular words, Solr will be much better for that. MySQL is not built for that and it will likely slow the server down a lot.

davidskalinder commented 4 years ago

I think storing as a long table loses a lot of the join power you get from moving Solr to MySQL. Even if there are a number of different columns which appear for different publications in Solr, it makes sense to more firmly structure the MySQL schema for articles.

So @olderwoman, having talked to @alexhanna, I'm coming around to her point of view on this. I think the bottom line is that there's only about a half dozen fields that MAI would likely ever need to know about an article, and there's a heckuva lot of junk fields in SOLR now that we'll probably never need (and can go back to the sources for if we do).

So the benefit of using a long table that can store everything is probably small; the cost is that querying a long table will be harder to code and less efficient (and indexing it to speed up queries would probably be prohibitively storage-intensive); this might, for instance cause loading lags for coders when loading new articles, especially for databases with lots of articles.

I think I'm a little less worried about the downside than @alexhanna is, but I do see her point. I think a sensible way to proceed will be to get the (long) full list of fields in SOLR -- ideally with some counts of how many articles have them (though alas this makes the task much trickier) -- and to assess which ones we actually care about? I think that will give us a better picture of what we'd lose with a wide-table format that only keeps those important fields... So unless I hear otherwise I'll work on getting that list together so we can have a look at it.

THAT SAID: @olderwoman, if you did want to do full-text searches like you have in the past to identify articles that mentioned particular words, Solr will be much better for that. MySQL is not built for that and it will likely slow the server down a lot.

This is true, and I hadn't thought of it before. @olderwoman, I'm guessing that in the past you've used this kind of searching mainly to audit the haystack task? If so then I think making MAI independent of SOLR won't make any difference; but maybe I'm overlooking some value of being able to search article text?

olderwoman commented 4 years ago

There are a variety of reasons to search article text for non-MPEDS articles. I’ve used it for (a) checking the screening of articles for US locations, in which I checked a few hundred and found a low % of errors, all of them involving mentions of other countries that confused the sorter. (b) trying to figure out why the Cincinnati riot articles selected by MPEDS didn’t appear until April 12, verifying that the newswires DID cover the events earlier but did not use the words protest or riot, which is probably why they didn’t get selected. (c) checking to see whether events referred to retrospectively in our data really were not originally covered in the newswires. (Mostly they were not, sometimes they actually were and MPEDS missed them for one reason or another.)

I also occasionally search the newswires for other reasons, e.g. to check whether a Black protest event mentioned in a book was or was not covered by them.

I would expect to be doing similar searches in the Black newspapers as the project moves forward.

From: davidskalinder notifications@github.com Sent: Wednesday, March 11, 2020 12:39 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)

I think storing as a long table loses a lot of the join power you get from moving Solr to MySQL. Even if there are a number of different columns which appear for different publications in Solr, it makes sense to more firmly structure the MySQL schema for articles.

So @olderwomanhttps://github.com/olderwoman, having talked to @alexhannahttps://github.com/alexhanna, I'm coming around to her point of view on this. I think the bottom line is that there's only about a half dozen fields that MAI would likely ever need to know about an article, and there's a heckuva lot of junk fields in SOLR now that we'll probably never need (and can go back to the sources for if we do).

So the benefit of using a long table that can store everything is probably small; the cost is that querying a long table will be harder to code and less efficient (and indexing it to speed up queries would probably be prohibitively storage-intensive); this might, for instance cause loading lags for coders when loading new articles, especially for databases with lots of articles.

I think I'm a little less worried about the downside than @alexhannahttps://github.com/alexhanna is, but I do see her point. I think a sensible way to proceed will be to get the (long) full list of fields in SOLR -- ideally with some counts of how many articles have them (though alas this makes the task much trickier) -- and to assess which ones we actually care about? I think that will give us a better picture of what we'd lose with a wide-table format that only keeps those important fields... So unless I hear otherwise I'll work on getting that list together so we can have a look at it.

THAT SAID: @olderwomanhttps://github.com/olderwoman, if you did want to do full-text searches like you have in the past to identify articles that mentioned particular words, Solr will be much better for that. MySQL is not built for that and it will likely slow the server down a lot.

This is true, and I hadn't thought of it before. @olderwomanhttps://github.com/olderwoman, I'm guessing that in the past you've used this kind of searching mainly to audit the haystack task? If so then I think making MAI independent of SOLR won't make any difference; but maybe I'm overlooking some value of being able to search article text?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-597771130, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5I3DDYHIQRTHWPVH6LRG7EEPANCNFSM4LAVKKVQ.

davidskalinder commented 4 years ago

There are a variety of reasons to search article text for non-MPEDS articles. I’ve used it for (a) checking the screening of articles for US locations, in which I checked a few hundred and found a low % of errors, all of them involving mentions of other countries that confused the sorter. (b) trying to figure out why the Cincinnati riot articles selected by MPEDS didn’t appear until April 12, verifying that the newswires DID cover the events earlier but did not use the words protest or riot, which is probably why they didn’t get selected. (c) checking to see whether events referred to retrospectively in our data really were not originally covered in the newswires. (Mostly they were not, sometimes they actually were and MPEDS missed them for one reason or another.) I also occasionally search the newswires for other reasons, e.g. to check whether a Black protest event mentioned in a book was or was not covered by them. I would expect to be doing similar searches in the Black newspapers as the project moves forward.

Hmm, so it sounds like b) and c) are instances of checking MPEDS's haystack performance? And so it makes sense for them to use MPEDS's data source (SOLR) to do them? But it sounds like a) might be to audit something that happened in MAI; and it's hard to tell for other searches.

In the past, have you done this kind of searching using the SOLR admin interface? If text search really is a high priority for the coding work that we do with MAI (as opposed to the haystack work that we do with MPEDS), then maybe moving MAI away from SOLR isn't a good idea; although it sounds to me like most of these text-searching tasks are somewhat peripheral to MAI?

olderwoman commented 4 years ago

Let’s see, yes b & c check the haystack. a) checked the geolocation routine, which was something Alex did between MPEDS and putting things into MAI, I believe. She first sent me a spreadsheet of ALL MPEDS-selected articles then, a couple days later, a subset that were geolocated to the US, to reduce the large # of international-only stories to wade through. I looked up a sample of articles that were in the original spreadsheet but not the second spreadsheet. Yes most of my other text searching is peripheral to MAI and more about taking advantage of the fact that I have a large cache of news sources to check facts in. This last is an intermittent task, so having the searches be relatively slow or inefficient would not be a big problem I don’t think.

From: davidskalinder notifications@github.com Sent: Wednesday, March 11, 2020 1:44 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)

There are a variety of reasons to search article text for non-MPEDS articles. I’ve used it for (a) checking the screening of articles for US locations, in which I checked a few hundred and found a low % of errors, all of them involving mentions of other countries that confused the sorter. (b) trying to figure out why the Cincinnati riot articles selected by MPEDS didn’t appear until April 12, verifying that the newswires DID cover the events earlier but did not use the words protest or riot, which is probably why they didn’t get selected. (c) checking to see whether events referred to retrospectively in our data really were not originally covered in the newswires. (Mostly they were not, sometimes they actually were and MPEDS missed them for one reason or another.) I also occasionally search the newswires for other reasons, e.g. to check whether a Black protest event mentioned in a book was or was not covered by them. I would expect to be doing similar searches in the Black newspapers as the project moves forward.

Hmm, so it sounds like b) and c) are instances of checking MPEDS's haystack performance? And so it makes sense for them to use MPEDS's data source (SOLR) to do them? But it sounds like a) might be to audit something that happened in MAI; and it's hard to tell for other searches.

In the past, have you done this kind of searching using the SOLR admin interface? If text search really is a high priority for the coding work that we do with MAI (as opposed to the haystack work that we do with MPEDS), then maybe moving MAI away from SOLR isn't a good idea; although it sounds to me like most of these text-searching tasks are somewhat peripheral to MAI?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-597803358, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5IXZNJ5G6VU54EIBIDRG7LXZANCNFSM4LAVKKVQ.

davidskalinder commented 4 years ago

Let’s see, yes b & c check the haystack. a) checked the geolocation routine, which was something Alex did between MPEDS and putting things into MAI, I believe. She first sent me a spreadsheet of ALL MPEDS-selected articles then, a couple days later, a subset that were geolocated to the US, to reduce the large # of international-only stories to wade through. I looked up a sample of articles that were in the original spreadsheet but not the second spreadsheet. Yes most of my other text searching is peripheral to MAI and more about taking advantage of the fact that I have a large cache of news sources to check facts in. This last is an intermittent task, so having the searches be relatively slow or inefficient would not be a big problem I don’t think.

Okay, so it sounds like most of the important uses of text search are MPEDS-adjacent and so can happily live in whatever universe MPEDS lives in without staying coupled to the MAI universe, huh...

Just one point to clarify: my (and, I think, @alexhanna's) worry about slow/inefficient querying isn't so much for the text search functionality (which would be "slowed down" by requiring the user to visit the SOLR interface, or the source files); it's that putting whole articles and/or lots of junk article fields into a (long-format, especially) MAI table might slow down everything MAI does with those articles (such as display them to coders). Which would obviously be bad.

So it sounds like we should continue with our plan for, ahem, SOLRicide, but review the fields in SOLR to see if we can avoid a long-format table by surviving without most of them...

olderwoman commented 4 years ago

One possibility re multiple fields is to list the TYPES of fields we need and map source fields into our fields as part of the inputting process

From: davidskalinder notifications@github.com Sent: Wednesday, March 11, 2020 3:50 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)

Let’s see, yes b & c check the haystack. a) checked the geolocation routine, which was something Alex did between MPEDS and putting things into MAI, I believe. She first sent me a spreadsheet of ALL MPEDS-selected articles then, a couple days later, a subset that were geolocated to the US, to reduce the large # of international-only stories to wade through. I looked up a sample of articles that were in the original spreadsheet but not the second spreadsheet. Yes most of my other text searching is peripheral to MAI and more about taking advantage of the fact that I have a large cache of news sources to check facts in. This last is an intermittent task, so having the searches be relatively slow or inefficient would not be a big problem I don’t think.

Okay, so it sounds like most of the important uses of text search are MPEDS-adjacent and so can happily live in whatever universe MPEDS lives in without staying coupled to the MAI universe, huh...

Just one point to clarify: my (and, I think, @alexhannahttps://github.com/alexhanna's) worry about slow/inefficient querying isn't so much for the text search functionality (which would be "slowed down" by requiring the user to visit the SOLR interface, or the source files); it's that putting whole articles and/or lots of junk article fields into a (long-format, especially) MAI table might slow down everything MAI does with those articles (such as display them to coders). Which would obviously be bad.

So it sounds like we should continue with our plan for, ahem, SOLRicide, but review the fields in SOLR to see if we can avoid a long-format table by surviving without most of them...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-597871771, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5KXZZ2AWSCTRG2BL6LRG72PJANCNFSM4LAVKKVQ.

davidskalinder commented 4 years ago

One possibility re multiple fields is to list the TYPES of fields we need and map source fields into our fields as part of the inputting process

Yes, I think this is more or less how it'll be (and indeed I suspect @alexhanna's scripts are doing a bit of this already to load the source files into SOLR. Hopefully the field list will clarify.

alexhanna commented 4 years ago

That list is somewhat short. Only the metadata which is guaranteed to be in Solr, which is listed here.

davidskalinder commented 4 years ago

That list is somewhat short. Only the metadata which is guaranteed to be in Solr, which is listed here.

Hah, yep, that list is powerful short. I think the issue is whether there's other stuff that we can or should be using (which we don't know about until we review the (long)longlist of fields)? It would be just peachy if we really only need those six (plus text) though...

davidskalinder commented 4 years ago

That list is somewhat short. Only the metadata which is guaranteed to be in Solr, which is listed here.

Hmm, so @alexhanna, I'm finally to the point where I can actually see some data for this, and I notice that several of those fields frequently don't appear -- looking just at my little test dataset of 19 articles about Wisconsin, none of them contain 'SECTION', 'BYLINE', or 'DATELINE'; and now that I think about it, several fields that every article must have (such as 'TITLE' and 'TEXT') aren't in the linked list. Might that list have changed, or maybe there's another bit of code somewhere that guarantees that certain fields show up in SOLR?

davidskalinder commented 4 years ago

All right, @olderwoman, @alexhanna, @matthewsmorgan, @limchaeyoon, @johnklemke: I've now got a bunch of other issues to the point where they can provide some data for this question. Below is a list of the fields in use for the 6453 articles currently in our production deployment. (Note that as for the articles mentioned above, none of these contain DATELINE or BYLINE, or SECTION, although some articles do have a field called Section.)

Also, I dumped everything SOLR had stored for these articles to a CSV file, which is at gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv.

So I think the object of the game here is to try to ensure that we won't need more than a small number of these fields for MAI or subsequent work (although of course everything will still be stored somewhere to audit MPEDS's article selection). I haven't really looked at the substance of any of this yet, but I'll try to do that soon and make a recommendation. In the meantime, recommendations from y'all are welcome!

Here's the current field usage for our 6453 articles in production:

Nonmissing entries in each column:
DATE                       6453
INTERNAL_ID                6453
PUBLICATION                6453
TEXT                       6453
TITLE                      6453
Abstract                   6453
Accession_number           2485
Author                     5501
Copyright                  6453
Country_of_publication     6453
DOCSOURCE                  6453
Database                   6453
Document_URL               6453
Document_feature           2483
Document_type              6453
Ethnicity                  6107
ISSN                       4363
Issue                      6017
Language_of_publication    6453
Last_updated               6453
Links                      6317
Location                   4773
Number_of_pages            6037
People                     3199
Place_of_publication       6453
ProQuest_document_ID       6453
Publication_date           6453
Publication_subject        6453
Publication_year           6453
Publisher                  6453
Section                    1373
Source_type                6453
Subject                    5998
Volume                     6015
Year                       6453
_version_                  6453
id                         6453

Of course let me know if anything needs clarification...

olderwoman commented 4 years ago

I’ve got to do some other stuff today and then have several meetings tomorrow so may not make a lot of progress on this until Thursday. OK to complain if I have not responded by Thursday April 9.

davidskalinder commented 4 years ago

I'm not entirely sure what's the best way to compile opinions on this, so I'll post this here for now but anyone can feel free to suggest a better location. Having looked at the fields, I think the following almost certainly should be kept through post-MPEDS steps:

DATE
Document_URL
PUBLICATION
TEXT
TITLE
id

I think the following can almost certainly be abandoned:

Abstract
Accession_number
Author
Country_of_publication
Document_feature
Ethnicity
ISSN
Issue
Language_of_publication
Number_of_pages
Publication_date
Publication_subject
Publication_year
Section
Source_type
Subject
Volume
Year

And I'm not sure about the following ones -- either I need to check more carefully to see whether they are duplicated elsewhere or I can't tell whether they add any value to post-MPEDS processes:

Publisher
Copyright
DOCSOURCE
Database
Document_type
INTERNAL_ID
Last_updated
Links
Location
People
Place_of_publication
ProQuest_document_ID
_version_

Incidentally, it seems that Publisher sometimes disagrees with what's before the semicolon in PUBLICATION, which seems odd; I'll try to find the code (probably here?) that builds the PUBLICATION field later to make sure it did it correctly (unless @alexhanna can straighten me out?)...

olderwoman commented 4 years ago

See comments below. In short, we need more carefully to review exactly what is in these fields from different sources.

From: davidskalinder notifications@github.com Sent: Tuesday, April 7, 2020 7:32 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)

I'm not entirely sure what's the best way to compile opinions on this, so I'll post this here for now but anyone can feel free to suggest a better location. Having looked at the fields, I think the following almost certainly should be kept through post-MPEDS steps:

DATE // is this publication date?

Document_URL

PUBLICATION

TEXT

TITLE

id

I think the following can almost certainly be abandoned:

Abstract

Accession_number // This would be the pointer back to the data source, possibly

Author

Country_of_publication

Document_feature

Ethnicity

ISSN // check what this is, may be relevant

Issue // part of citation for some sources as in volume:issue

Language_of_publication

Number_of_pages

Publication_date // this is needed

Publication_subject

Publication_year

Section // this is part of a newspaper citations

Source_type // need to see what these are

Subject

Volume // part of a citation

Year

And I'm not sure about the following ones -- either I need to check more carefully to see whether they are duplicated elsewhere or I can't tell whether they add any value to post-MPEDS processes:

Publisher

Copyright

DOCSOURCE // what are the values, may be needed

Database // we need this

Document_type

INTERNAL_ID // this is probably needed

Last_updated // possibly part of the citation

Links

Location // in some cases may be the dateline, may be needed

People

Place_of_publication // needed if not part of the citation

ProQuest_document_ID // this is definitely needed to document the file

version

Incidentally, it seems that Publisher sometimes disagrees with what's before the semicolon in PUBLICATION, which seems odd; I'll try to find the code (probably here?https://github.com/alexhanna/news-parsers/blob/master/proquest.py) that builds the PUBLICATION field later to make sure it did it correctly (unless @alexhannahttps://github.com/alexhanna can straighten me out?)...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-610686450, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5NNXE33PO357FBJHRDRLPAZXANCNFSM4LAVKKVQ.

davidskalinder commented 4 years ago

Tl;dr: Based on feedback from @alexhanna and @olderwoman, I'm starting to think that for now at least, we shouldn't worry much about any SOLR fields that MAI doesn't use already.

Non-tl;dr version:

Talking to @alexhanna this morning, as usual, helped clarify this. Basically, there are very few fields that are guaranteed to be in SOLR. (It's not clear exactly which ones, but it's close to all the items in all-caps in the lists above.) Any other fields might vary by data source. Fundamentally, I think that means that MAI and any subsequent processes can't rely on anything except those fields being present (unless we do a great deal of cleaning of SOLR's several million records).

Also, reading through @olderwoman's responses, I think I'm getting a clearer idea of things we'd like to do with whatever data we carry beyond SOLR. Here's the list as I see it:

  1. Ensure that MAI and subsequent passes have the data necessary to code and analyze
  2. Be able to find any article in SOLR
  3. Be able to find any article within the files that were loaded into SOLR
  4. Be able to request any article from an external database provider (ProQuest/LexisNexis/etc.)
  5. Be able to request any article from the original publication
  6. Be able to cite any article within an acadmic bibliography

Goal 1 is actually relatively easy, since at the moment MAI only uses a small number of SOLR's fields, and there aren't many more that we want to start using. Goal 2 is satisfied by SOLR's id field (which subsequent processes call SOLR_ID). I think goal 3 is satisfied, as well as it can be, by the DOCSOURCE and INTERNAL_ID fields (and in cases where these are ambiguous, searches of content TITLE or even TEXT would probably be sufficient, and probably as good as we can do).

Goals 4, 5, and 6 seem trickier because they all require information that isn't stored in SOLR consistently. (A notable example is the EDITION field, which seems to distinguish different version of articles in the New York Times but doesn't exist for any other publication; also, of course, every data source will likely have unique, and possibly unique numbers of, ways to identify any particular article.) Having said that, we should recall that at the moment, MAI and (I think?) subsequent processes can't accomplish goals 3, 4, 5, or 6 either: the only way to get information like this currently is to go back to upstream sources.

I think this means that for this issue (which is, or at least should be, to replicate current functionality without SOLR), we should focus on goals 1, 2, and maybe 3 to be on the safe side; goals 4, 5, and 6 seem like new feature requests and therefore should be treated as separate issues?

I also have line-by-line comments on @olderwoman's responses above, which I'll post in a separate comment.

davidskalinder commented 4 years ago

Specific replies to the points raised by @olderwoman:

See comments below. In short, we need more carefully to review exactly what is in these fields from different sources.

Just to be clear, my recommendations here were based on looking at the articles currently in our production version (which look like they came from Proquest?). But as noted above, most of these fields might be completely different in articles from different sources.

DATE // is this publication date?

It seems to be a cleaned-up version of publication date, yes.

Accession_number // This would be the pointer back to the data source, possibly

Only about a third of the articles have this field, so I doubt it; and also, other fields (such as Document_URL and ProQuest_document_ID) seem to be unique identifiers for the articles. Probably a good idea to find out what "accession number" means in order to be sure though.

ISSN // check what this is, may be relevant

This is missing for about two-thirds of the articles, and for the rest it looks like each publication has its own ISSN, so hundreds of articles all have the same number; so I assume it's not doing us much good in the database, but of course others might know better...

Issue // part of citation for some sources as in volume:issue

Ah, yes, if we want to keep all the citation information then perhaps we should keep this. (EDIT: but see https://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-611150228)

Publication_date // this is needed

Is there a reason why we would need both this and the more-consistently-formatted DATE field with the same information?

Section // this is part of a newspaper citations

Is this for ASA style, or some other style? I can't find anything about sections in my ASA 4th ed., and I don't recall needing these for other styles, but you certainly know better than I. These are frequently missing and often slightly silly, FWIW (e.g., "BACK TO SCHOOL", "BILL'S BUSINESS", or often simply "NEWS").

Source_type // need to see what these are

For this particular dataset, it's the word "Newspapers" for all 6,453 entries.

Volume // part of a citation

My reading of ASA, CMOS, and APA styles is that volume is only used for journals? But again, happy to defer to you on this stuff.

DOCSOURCE // what are the values, may be needed

Check the CSV for the full list, but the values are things like these:

Proquest (ethnic_newswatch_publication_files/Ethnic Newswatch Publication Files/Ethnic News Hyde Park 1991-2015 1-1000.txt)
Proquest (ethnic_newswatch_publication_files/Ethnic Newswatch Publication Files/EthnicNews Chicago Citizen 1991-2015 2001-2700.txt)

These seem to be the names of the files that were loaded into SOLR. There are between like 10 and 120ish articles from each of these source files.

Database // we need this

What made me hesitate is that this is simply "Ethnic NewsWatch" for all 6,453 articles. But if we're not storing this info elsewhere then yeah maybe we need to keep this field.

INTERNAL_ID // this is probably needed

INTERNAL_ID is almost the same as id but with an integer on the end instead of a hash. (id is what is elsewhere called SOLR_ID.) @alexhanna reports that the integer is a not-very-consistent identifier within the source file, usually something like the 'x' in the source files' "Article x of 500" lines.

Last_updated // possibly part of the citation

I'm guessing this indicates when it was updated in some database or other, since 1,905 articles all have the same value. However some of the dates are in 2010, so I doubt it's for any of our systems. Maybe this is when it was updated in ProQuest?

Location // in some cases may be the dateline, may be needed People

Yeah I wasn't sure whether either or both of these would come in handy. But FWIW, I don't think either of these have ever been accessible except via a custom query using SOLR's web interface. Also @alexhanna indicates that fields like these are often inconsistent and might be totally different or missing in different data sources.

Place_of_publication // needed if not part of the citation

Not sure what you mean by citation, but for these articles at least, Place_of_publication is always identical to what's after the semicolon in PUBLICATION (though I don't know whether this will be true for other data sources). Again, I don't think this field has been in MAI or in pass 2 up to now.

ProQuest_document_ID // this is definitely needed to document the file

In these articles, this ID is always part of the Document_URL field; since that field also includes other information, I recommended we keep that instead. However other data sources probably include neither a ProQuest_document_ID nor a Document_URL field.

olderwoman commented 4 years ago

Basically what we need is a datafile that has all fields minus the full text so we can figure out which sources use which fields and what is in them. There is no way to figure this out just by looking at frequencies.

davidskalinder commented 4 years ago

Basically what we need is a datafile that has all fields minus the full text so we can figure out which sources use which fields and what is in them. There is no way to figure this out just by looking at frequencies.

Uh, yes, sorry -- from my comment yesterday:

Also, I dumped everything SOLR had stored for these articles to a CSV file, which is at gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv.

I guess let me know if you're having trouble getting to that file and I can put a copy someplace more convenient? (It's about 38MB, so unfortunately email probably won't work.)

olderwoman commented 4 years ago

Ah, good. I’ll review that tomorrow or Friday. Bug me if I have not done it by midday Friday.

From: davidskalinder notifications@github.com Sent: Wednesday, April 8, 2020 2:56 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)

Basically what we need is a datafile that has all fields minus the full text so we can figure out which sources use which fields and what is in them. There is no way to figure this out just by looking at frequencies.

Uh, yes, sorry -- from my comment yesterday:

Also, I dumped everything SOLR had stored for these articles to a CSV file, which is at gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv.

I guess let me know if you're having trouble getting to that file and I can put a copy someplace more convenient? (It's about 38MB, so unfortunately email probably won't work.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-611161666, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5JNZNIXJ3TKAH5TTJDRLTJGHANCNFSM4LAVKKVQ.

davidskalinder commented 4 years ago

I just realized it would be easy and valuable to also produce s CSV and field counts for @alexhanna's campus protest articles to get a sense of how different SOLR's contents are for different data sources (spoiler alert: very).

So there's now a new file at gdelt/Skalinder/SOLR_exports/campus_protest_solr_output_2020-04-09_105647.csv with all the SOLR fields for the campus protest articles.

The count of non-empty fields in both databases is as follows:

field bpp campus_protest
DATE 6453 9216
INTERNAL_ID 6453 6667
PUBLICATION 6453 9216
TEXT 6453 9216
TITLE 6453 9216
Abstract 6453  
Accession_number 2485  
Author 5501  
Copyright 6453  
Country_of_publication 6453  
DOCSOURCE 6453 9216
Database 6453 9166
Document_URL 6453  
Document_feature 2483  
Document_type 6453  
Ethnicity 6107  
ISSN 4363  
Issue 6017  
Language_of_publication 6453  
Last_updated 6453  
Links 6317  
Location 4773  
Number_of_pages 6037  
People 3199  
Place_of_publication 6453  
ProQuest_document_ID 6453  
Publication_date 6453  
Publication_subject 6453  
Publication_year 6453  
Publisher 6453  
Section 1373  
Source_type 6453  
Subject 5998  
Volume 6015  
Year 6453  
_version_ 6453 9216
id 6453 9216
BYLINE   6107
DATELINE   40
SECTION   6667
A   1
AM   1
DOCID   50
LANGUAGE   6667
LENGTH   6667
LOAD-DATE   6667
PUBLICATION-TYPE   6667
Q   1
URL   2499

So as you can see, not a huge amount of overlap...

olderwoman commented 4 years ago

Yes, this is helpful. Thanks.

alexhanna commented 4 years ago

Just for background -- 6,667 of the campus protest articles from Lexis-Nexis, and 2,549 are scraped from the web.

davidskalinder commented 4 years ago

Another new database's worth: this time, from the black-newswires deployment:

field bpp campus_protest black-newswires
DATE 6453 9216 2471
INTERNAL_ID 6453 6667
PUBLICATION 6453 9216 2471
TEXT 6453 9216 2471
TITLE 6453 9216 2471
Abstract 6453
Accession_number 2485
Author 5501
Copyright 6453
Country_of_publication 6453
DOCSOURCE 6453 9216 2471
Database 6453 9166
Document_URL 6453
Document_feature 2483
Document_type 6453
Ethnicity 6107
ISSN 4363
Issue 6017
Language_of_publication 6453
Last_updated 6453
Links 6317
Location 4773
Number_of_pages 6037
People 3199
Place_of_publication 6453
ProQuest_document_ID 6453
Publication_date 6453
Publication_subject 6453
Publication_year 6453
Publisher 6453
Section 1373
Source_type 6453
Subject 5998
Volume 6015
Year 6453
_version_ 6453 9216 2471
id 6453 9216 2471
BYLINE 6107
DATELINE 40 2088
SECTION 6667
A 1
AM 1
DOCID 50 2471
LANGUAGE 6667
LENGTH 6667
LOAD-DATE 6667
PUBLICATION-TYPE 6667
Q 1
URL 2499
davidskalinder commented 4 years ago

And another update, now with data from the bp deployment (which might have had data from the 'boycott error' era). Note that I've sorted the fields alphabetically.

As the sage @alexhanna hath foretold, the only fields common to all of these sources are _version_, DATE, DOCSOURCE, id, PUBLICATION, TEXT, and TITLE. (In fact, every article has all of these except for a few that are missing TITLE for some reason.)

field bpp campus_protest black-newswires bp
_version_ 6453 9216 2471 34141
A       44
A   1    
AA       1
Abstract 6453      
Accession_number 2485      
AGE       2
AM   1    
ATWATER       1
AU       1
Author 5501      
BOESAK       1
BROWN       1
BUSH       5
BYLINE       13289
BYLINE   6107    
CLINTON       4
Copyright 6453      
CORNYN       1
CORRECTION       6
Country_of_publication 6453      
Database 6453 9166    
DATE 6453 9216 2471 34141
DATELINE   40 2088 20000
DEMOCRATS       1
DISTRIBUTION       112
DOCID   50 2471 14710
DOCSOURCE 6453 9216 2471 34141
Document_feature 2483      
Document_type 6453      
Document_URL 6453      
DOCUMENT-TYPE       55
DURBIN       1
EDITION       9881
Ethnicity 6107      
FANG       1
GONZALES       1
GRAHAM       1
GRAPHIC       2303
HIGHLIGHT       15
id 6453 9216 2471 34141
INTERNAL_ID 6453 6667   19431
IS_DOCA       6880
ISSN 4363      
Issue 6017      
LANGUAGE   6667   11328
Language_of_publication 6453      
Last_updated 6453      
LEAD       1824
LENGTH   6667   18278
Links 6317      
LOAD-DATE   6667   5714
Location 4773      
MARYLAND       1
Number_of_pages 6037      
People 3199      
PEROT       4
PHOTO       1
Place_of_publication 6453      
PRINT_COLUMN       6874
PRINT_PAGE_NUMBER       6880
PRINT_SECTION       6874
ProQuest_document_ID 6453      
PUBLICATION 6453 9216 2471 34141
Publication_date 6453      
Publication_subject 6453      
Publication_year 6453      
PUBLICATION-TYPE   6667   2191
Publisher 6453      
Q   1   48
QUESTION       1
RE       9
Section 1373      
SECTION   6667   18141
SERIES       229
SOURCE       846
Source_type 6453      
SPECTER       1
Subject 5998      
TEXT 6453 9216 2471 34140
TITLE 6453 9216 2471 33925
TV       1
TYPE       2963
URL   2499   6880
VIRGINIA       1
Volume 6015      
Year 6453      
ZOGBY       1
olderwoman commented 4 years ago

This is a little sketchy without the tabs; can you put a tab delimited version into a spreadsheet of some sort and put it either on gdelt or in the Google drive (I’ve created a folder inside technical documentation)? I’m afraid it won’t parse right from this email.

davidskalinder commented 4 years ago

This is a little sketchy without the tabs; can you put a tab delimited version into a spreadsheet of some sort and put it either on gdelt or in the Google drive (I’ve created a folder inside technical documentation)? I’m afraid it won’t parse right from this email.

Yeah, it'll format correctly if you view the thread in a browser; however you're right that even there it's getting a bit ungainly. So I've put it into a spreadsheet in the Google Drive in Technical documentation > MAI > SOLR_field_usage.

olderwoman commented 4 years ago

thanks

davidskalinder commented 4 years ago

Thanks @olderwoman! I've copied your response to the issue thread so I can find it later:

I have reviewed all the SOLR fields. David copied all the fields & their counts into a spreadsheet and I have annotated the spreadsheet with what the fields are and whether they are needed. Note that some fields have comparable information for different sources, e.g. location has the same info as dateline. I did NOT review the campus protest info; it is not in the bp data set. The bp dataset looks like what was being used for the core MPEDS analysis, includes LDC (the NYT database), articles selected from various newssources from Lexis-Nexis to test MPEDS, and the full Gigaword dataset. Ethnic newswatch articles are in bpp. The black-newswires is a subset I’m guessing of bp; perhaps we can test this theory. But I think black-newswires is what was sent to the black-newswires MIA instance.

(The spreadsheet is the one at Technical documentation > MAI > SOLR_field_usage.)

So: thanks @olderwoman, super helpful! I think we should probably discuss further on Monday? I'd say my comment upthread at https://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-611150228 (NB you might have to view the thread in a web browser to see the link) still represents the crux of my thinking on this, so it'd be good to discuss that plan and how the existing fields fit into it...

davidskalinder commented 4 years ago

A summary of the discussion from this morning's meeting:

The goals I came up with in my comment above seem to be a good list, with the added consideration that items 4-6 should be able to be performed by non-BPP recipients of the dataset:

1. Ensure that MAI and subsequent passes have the data necessary to code and analyze
2. Be able to find any article in SOLR
3. Be able to find any article within the files that were loaded into SOLR
4. Be able to request any article from an external database provider (ProQuest/LexisNexis/etc.)
5. Be able to request any article from the original publication
6. Be able to cite any article within an academic bibliography

Based on my and @olderwoman's review of the existing SOLR fields, it's clear that there are about a half-dozen fields that have already been cleaned by @alexhanna's scripts when they were loaded into SOLR, and most other SOLR fields are of varying value and are stored inconsistently. One particular piece of inconsistently-stored information stands out as being especially valuable for coding: article location (which is sometimes stored in Location and sometimes in BYLINE, at least).

This suggests three major tasks (and some subtasks) with a clear priority order:

  1. Get the already-cleaned SOLR fields into the MAI database
    1. Replace MAI's SOLR calls with queries of this info from its own database
  2. Do any necessary cleanup of article location/byline information and get that into the MAI database
  3. Consider/do cleaning and import of any other fields that would be valuable

Getting the first of these major tasks done should mostly cover goals 1-3 in the list of goals above. The second major task will provide a high-value boost to goal 1 (by adding some important functionality to the coding interface). The last major task will mostly focus on goals 4-6 above, which will be important down the road but so far have not been doable in any systems downstream from SOLR (and so will stay that way for the time being).

So unless new information emerges, I'll get these tasks into separate issues and start in on the first one...

olderwoman commented 4 years ago

Correction: location = DATELINE not Byline.

davidskalinder commented 4 years ago

Correction: location = DATELINE not Byline.

Yes, quite right -- thanks!

davidskalinder commented 3 years ago

Reviving this epic by adding a few issues to it that should have been there in the first place. For people who can't see which issues are in a ZenHub epic, and to add some comments on priority, here's the status atm:

Open, high-priority, in order of necessary implementation I think:

71 (done in enable_internal_article_storage, waiting to PR)

113 (done in add_clean_solr_fields, waiting to PR)

111 (done in etl_clean_fields, waiting to PR)

Open, lower-priority, in descending priority order:

110

103

72

74

Closed:

61

67

70 (itself an epic containing several closed issues)

78 (done, now part of #113)

79 (done, now part of #111)

EDIT: Added #111 EDIT: Moved #111 from top to bottom of high-priority section

davidskalinder commented 3 years ago

Now that #111 is to the point where I know that the etl_clean_fields can be PR'd cleanly, I think I should put off the actual PR, since I'll likely build the full-text-pulling functionality on top of that branch. So I'll edit the priority list above to move #111 from the top to the bottom of the high-priority section.

davidskalinder commented 3 years ago

I just remembered something the BPP team discussed Monday that's worth noting here, both for @alexhanna's benefit and to remind me: we realized that since we'll probably want to add some more articles to our deployment in early January (or ASAP!), and given that we/you have recently practiced that procedure with the current setup, it's probably actually best not to deploy all the changes in this issue until after that batch of articles gets added.

It still might be worth accepting PRs into test deployments of course, but our thinking was that it's probably best not to muck with what's working until after it's no longer imminently critical. So when everything is done I'll deploy it to our testing instance and recommend that @alexhanna do something similar on her end, but I shouldn't pull either my feature branches or the accepted changes from upstream into our production deployment until after the new articles are in.

davidskalinder commented 3 years ago

All righty, the essential stuff for this should now all be built and should be deployed to a non-dev testing deployment and then PR'd (but not deployed to production until after the next article load is done).

davidskalinder commented 3 years ago

EDIT: Before doing any merges, add

## whether or not to store articles internally
STORE_ARTICLES_INTERNALLY = True

... to config.py. (It's fine to set the value to False, but it chokes if the thing isn't there at all. Which is... fine? I guess?)

Then, branches that need merging to get this into the testing/live deployment:

add_clean_solr_fields
etl_clean_fields
enable_internal_article_usage

EDIT: Also, python scripts/etl_clean_fields.py needs to be run (from the mpeds environment) after etl_clean_fields is merged.

davidskalinder commented 3 years ago

This all now looks good in the testing deployment.

So, after we load the next new batch of articles in early Jan, I'll PR these. I should include notes as follows:

add_clean_solr_fields: Needs the ALTER TABLE from #113 (with any already-added fields removed)

etl_clean_fields: Any time after accepting this, run scripts/etl_clean_fields.py from the mpeds environment to get all the stuff from Solr into MAI's DB (where, until the next step, it won't do anything).

enable_internal_article_usage: Before making this live, add STORE_ARTICLES_INTERNALLY = True to config.py (or, to keep using Solr, use STORE_ARTICLES_INTERNALLY = False)

davidskalinder commented 3 years ago

And now also need to PR load_article_data_from_csv, which includes add_clean_solr_fields. load_article_data_from_csv shouldn't do anything until setup.py is run, at which point config.py should include DOC_ROOT, DOC_FILE, and DOC_DBNAME in order to point at the correct import file.

davidskalinder commented 3 years ago

@olderwoman, @matthewsmorgan, and @limchaeyoon, I thought I had better call your attention to this one:

@alexhanna and I agreed that given Solr's most recent flakiness, the best thing to do in order to get new articles loaded will be to simply to activate all the Solr-killing changes on the production deployment. I think that will be fine, but I don't think anybody except me has test-driven any of the new changes yet, so there is a slight risk of something ending up not to the team's liking.

However all the changes should be live now in the testing deployment if you want to give them a look. I believe that actually the only visible change is to the "Solr ID" that displays above the article -- in the new version, the entire ID is shown as it is in the database, whereas in the old version it's formatted differently and is missing the hash at the end. But it might not be a bad idea to have a poke at the testing deployment and make sure you don't spot anything wrong.

I'll probably be rolling everything out to production in the next few hours, so please give a shout if you see anything amiss (and TBH even if we miss something it might not be a disaster since we've only got two coders at this stage)...

olderwoman commented 3 years ago

I can take a quick look at this, is it the same url as the previous testing site?

davidskalinder commented 3 years ago

I can take a quick look at this, is it the same url as the previous testing site?

Yep

olderwoman commented 3 years ago

I looked at this, it seems ok. Is there any functionality specifically I should test?

From: davidskalinder notifications@github.com Sent: Tuesday, February 2, 2021 5:18 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)

I can take a quick look at this, is it the same url as the previous testing site?

Yep

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-772081687, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5JYZBPUVLNK7DSOQNTS5CBZLANCNFSM4LAVKKVQ.

davidskalinder commented 3 years ago

I looked at this, it seems ok.

Okay cool.

Is there any functionality specifically I should test?

Not really I don't think, since almost all the changes are under the hood. I'd say the usual checks of making sure all the expected form controls are there and that they save information across page loads as expected and such are all we need, just to make sure that the changes haven't broken anything unexpected and essential.

olderwoman commented 3 years ago

I didn’t try to break it, but I did code 2 articles, one with no events and another with one event and it seemed to work.

From: davidskalinder notifications@github.com Sent: Tuesday, February 2, 2021 5:27 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)

I looked at this, it seems ok.

Okay cool.

Is there any functionality specifically I should test?

Not really I don't think, since almost all the changes are under the hood. I'd say the usual checks of making sure all the expected form controls are there and that they save information across page loads as expected and such are all we need, just to make sure that the changes haven't broken anything unexpected and essential.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-772085838, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5IXSDZ5XHWA6YQFXX3S5CC5XANCNFSM4LAVKKVQ.

davidskalinder commented 3 years ago

I didn’t try to break it, but I did code 2 articles, one with no events and another with one event and it seemed to work.

Okay, sounds promising. I think things are pretty much squared away for all this to happen, but I'm going to hold on hitting the buttons for now since it'll be better to do it while I can monitor things for a few hours, and I need to touch up a mostly-done ASA submission tomorrow and the coders should still have at least several days of articles to go anyway. But hopefully this will be an easy roll-out, probably on Thurs.

davidskalinder commented 3 years ago

All right, here goes. My plan today is to:

In the at-this-point-pretty-likely event that nothing goes wrong, nobody will notice anything except a new hash string in the article headers. But I'll be on call today through the afternoon just in case.