Closed davidskalinder closed 3 years ago
Answers to some of these below
From: davidskalinder notifications@github.com Sent: Tuesday, March 3, 2020 4:59 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: [davidskalinder/mpeds-coder] Kill SOLR? (#59)
This would be a big change under the hood of MAI, so I'm mentioning the whole team here since I don't want to do this without consultation: @olderwomanhttps://github.com/olderwoman, @matthewsmorganhttps://github.com/matthewsmorgan, @chaeyoonlimhttps://github.com/chaeyoonlim, @johnklemkehttps://github.com/johnklemke. Also I think deciding this is a fairly high priority, since it will affect how I implement a bunch of other stuff (like pressing issues #58https://github.com/davidskalinder/mpeds-coder/issues/58, #56https://github.com/davidskalinder/mpeds-coder/issues/56, #45https://github.com/davidskalinder/mpeds-coder/issues/45, as well as slightly-less-pressing issues #2https://github.com/davidskalinder/mpeds-coder/issues/2 and possibly #53https://github.com/davidskalinder/mpeds-coder/issues/53); and I know some people will be out of town the next few Mondays so I thought it'd be a good idea to raise the issue here. So please let me know if you have opinions, or if I can help clarify anything...
I'm thinking of excising SOLR from MAI altogether. Not right away necessarily, but moving in that direction. @alexhannahttps://github.com/alexhanna confirms that SOLR is part of MAI mostly for legacy reasons and because it's useful for MPEDS. Below are the pros and cons I can think of.
Pro:
Con:
We risk having an extra version of the truth. In theory, the text of an article could change in SOLR and not in MAI.
Depending how we do it, we might lose some of the data stored in some of SOLR's more obscure fields.
Loading new articles would become more difficult because we'd need all the info for the articles rather than just the SOLR IDs (and optionally, article titles and publication dates) that we need now.
We'd have to worry slightly more about how many articles we store. Tens of thousands is almost certainly fine, millions might be less fine. PAM: So this is because we’d have multiple copies of the same files or because the storage would be less efficient?
Implementation-wise, I think at the moment the only thing that actually uses SOLR at all is the controller's loadSolr function which loads up the article text! If that's right, then there wouldn't be very much to break with this change (in MAI itself anyway). There are also some bits (such as the controller's convertIDToPublication function) that chisel information out of the SOLR IDs and could be implemented better. It's harder to know about pass 2 without knowing exactly what input @olderwomanhttps://github.com/olderwoman's Stata scripts expect, although I'm relatively confident that we can reproduce old inputs using information in MAI as well as we would be able to with information in SOLR?
PAM: My Stata scripts are accessible on gdelt and can be readily parsed for inputs expected.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59?email_source=notifications&email_token=ADBJJ5JSG24HHBKFV4FUINTRFWDRRA5CNFSM4LAVKKV2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ISGAJKQ, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5M6M44CEJLRXIDP673RFWDRRANCNFSM4LAVKKVQ.
- However 1: I don't think we know of any reason why these fields should ever change except due to error.
- However 2: There is already an extra version of the truth in the pass 2 DB, and even if we eliminate that there'll still be the files downloaded from the news services; so if the line of what counts as the Truth is being drawn arbitrarily anyway, we might as well draw it after MPEDS has finished using SOLR?
PAM: (1) FWIW we do have the original files we got from gigaword. They are annoying I think because I originally purchased the wrong version, so there is markup that Alex had to strip out. But we have it. Or should have it. It came on a hard drive and I’m pretty sure we made a copy of it somewhere, although I’d have to look for it. (2) The thing that isn’t easy for us to replicate is the results of MPEDS.
Yeah, that's what I meant by the downloaded files -- I'm pretty sure we still have them for all the article sources. So to clarify, my concern is that if anything downstream of those files somehow gets out of sync with them, then it could be difficult to find and reconcile what's changed (though I think my two howevers above mitigate this worry considerably).
- Depending how we do it, we might lose some of the data stored in some of SOLR's more obscure fields.
- However 1: We're not using most of these fields anyway (do we care about an article's Number_of_pages, ProQuest_document_ID, or Accession_number? If we do need some, perhaps we could make a list?
- However 2: If we need everything, we could in theory load every field into a long-format table in MAI (with columns id, field_name, and field_content or some such); though this would make some joins a little more difficult.
PAM: The download information from Proquest probably should be retained. Each source of files will have a different set of fields, so the Gigaword files have different fields from the Proquest files.
Yes. In fact, there are hundreds of different fields currently stored in SOLR: I believe they vary not only from source to source but from article to article. A few, like publication date, title, text, and (I think) publication name are probably present for every article, but a lot of the fields probably only appear in a few articles.
Also, I think any version of what I'm proposing would still retain the SOLR ID in MAI -- we just wouldn't use it very much. But that would still give us a way to find our way back to any information that's in SOLR about any particular article. So I'm struggling to come up with a way that this issue would make it more difficult than it already is to trace an article back to all the other places it exists... I guess if we stop paying attention to SOLR because most of the stuff we need is in MAI and then the SOLR database gets erased? In which case we could still (albeit with a fair amount of work) search the download files for stuff like the article title and date and so on...
I guess on the other hand, if I don't move a bunch of info to MAI's database, then I'll have to build some more helper tools to query SOLR more efficiently. So there could be an opportunity cost of moving information to MAI and increasingly ignoring SOLR.
It would probably make sense to have a landing location for each input of files (Gigaword, Ethnic Newswatch, Proquest searches at this point) that are all files pre-MPEDS, then a storage location for the results of the MPEDS selection process (the subset of articles that go to MAI).
As it stands, the article information that we got from the news sources lives in: the downloaded files from the sources, the SOLR database that MPEDS uses, and the pass 2 database, all connected with extract-transform-loads. At the moment, MAI stores hardly anything about the articles -- the interface just calls up SOLR and asks for the article info that corresponds to a certain ID.
I'm suggesting that we change that, treat SOLR as a data provider (as we would treat Proquest or the GSS), and store what we need in MAI itself, treating that as our canonical data. So the new setup would be that info we get from the news sources lives in downloaded files from the sources, the SOLR database that MPEDS uses, the MAI database, and the pass 2 DB.
We often need to do some rough comparisons of the original bank of stories and the MPEDS results for methodological documentation, and this is one thing it has been hard to generate.
I think this analysis must happen upstream of MAI anyway? So any changes to MAI or subsequent processes wouldn't change anything, unless I'm missing something?
- We'd have to worry slightly more about how many articles we store. Tens of thousands is almost certainly fine, millions might be less fine.
PAM: So this is because we’d have multiple copies of the same files or because the storage would be less efficient?
More like the second one. I suppose we would be using more disk space technically, but I don't think that's much of a worry. It's more that I don't know how fast MySQL could do things like sort millions of 5,000-word articles, whereas SOLR has lots of tricks to do stuff like that quickly.
It's harder to know about pass 2 without knowing exactly what input @olderwomanhttps://github.com/olderwoman's Stata scripts expect, although I'm relatively confident that we can reproduce old inputs using information in MAI as well as we would be able to with information in SOLR?
PAM: My Stata scripts are accessible on gdelt and can be readily parsed for inputs expected.
True! Although occasionally the specifics of the fields they expect are a little hard to locate (for me, anyway!). But as I say, bringing some SOLR fields into MAI should make it easier to access more information (indeed this is the biggest motivation for it), so I'm pretty confident that we can reproduce the current inputs one way or another if we need. It's possible that the changes could result in annoyances like changed column names, for example, but we should be able to track these down or work around them.
I recognize there's a lot of complexities with this issue; maybe let me know if you want to set up a call (perhaps with an appropriate subteam) to discuss synchronously?
Outcome from team meeting: kill, baby, kill. We should use a long table, both to get as much as possible from SOLR and to stay flexible for the possibility of different fields in future article batches.
I'm leaving this issue open for now until details get moved into the right implementation threads.
I am just getting to reading this right now.
I think storing as a long table loses a lot of the join power you get from moving Solr to MySQL. Even if there are a number of different columns which appear for different publications in Solr, it makes sense to more firmly structure the MySQL schema for articles.
I do think the biggest problem with this will be size and memory. Again, I don't know how MySQL performs with big text blobs memory-wise. But given that we will be using a unique index to store Solr IDs, memory shouldn't (?) be an issue because we're not churning through big unstructured text blobs to find particular strings.
THAT SAID: @olderwoman, if you did want to do full-text searches like you have in the past to identify articles that mentioned particular words, Solr will be much better for that. MySQL is not built for that and it will likely slow the server down a lot.
I think storing as a long table loses a lot of the join power you get from moving Solr to MySQL. Even if there are a number of different columns which appear for different publications in Solr, it makes sense to more firmly structure the MySQL schema for articles.
So @olderwoman, having talked to @alexhanna, I'm coming around to her point of view on this. I think the bottom line is that there's only about a half dozen fields that MAI would likely ever need to know about an article, and there's a heckuva lot of junk fields in SOLR now that we'll probably never need (and can go back to the sources for if we do).
So the benefit of using a long table that can store everything is probably small; the cost is that querying a long table will be harder to code and less efficient (and indexing it to speed up queries would probably be prohibitively storage-intensive); this might, for instance cause loading lags for coders when loading new articles, especially for databases with lots of articles.
I think I'm a little less worried about the downside than @alexhanna is, but I do see her point. I think a sensible way to proceed will be to get the (long) full list of fields in SOLR -- ideally with some counts of how many articles have them (though alas this makes the task much trickier) -- and to assess which ones we actually care about? I think that will give us a better picture of what we'd lose with a wide-table format that only keeps those important fields... So unless I hear otherwise I'll work on getting that list together so we can have a look at it.
THAT SAID: @olderwoman, if you did want to do full-text searches like you have in the past to identify articles that mentioned particular words, Solr will be much better for that. MySQL is not built for that and it will likely slow the server down a lot.
This is true, and I hadn't thought of it before. @olderwoman, I'm guessing that in the past you've used this kind of searching mainly to audit the haystack task? If so then I think making MAI independent of SOLR won't make any difference; but maybe I'm overlooking some value of being able to search article text?
There are a variety of reasons to search article text for non-MPEDS articles. I’ve used it for (a) checking the screening of articles for US locations, in which I checked a few hundred and found a low % of errors, all of them involving mentions of other countries that confused the sorter. (b) trying to figure out why the Cincinnati riot articles selected by MPEDS didn’t appear until April 12, verifying that the newswires DID cover the events earlier but did not use the words protest or riot, which is probably why they didn’t get selected. (c) checking to see whether events referred to retrospectively in our data really were not originally covered in the newswires. (Mostly they were not, sometimes they actually were and MPEDS missed them for one reason or another.)
I also occasionally search the newswires for other reasons, e.g. to check whether a Black protest event mentioned in a book was or was not covered by them.
I would expect to be doing similar searches in the Black newspapers as the project moves forward.
From: davidskalinder notifications@github.com Sent: Wednesday, March 11, 2020 12:39 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)
I think storing as a long table loses a lot of the join power you get from moving Solr to MySQL. Even if there are a number of different columns which appear for different publications in Solr, it makes sense to more firmly structure the MySQL schema for articles.
So @olderwomanhttps://github.com/olderwoman, having talked to @alexhannahttps://github.com/alexhanna, I'm coming around to her point of view on this. I think the bottom line is that there's only about a half dozen fields that MAI would likely ever need to know about an article, and there's a heckuva lot of junk fields in SOLR now that we'll probably never need (and can go back to the sources for if we do).
So the benefit of using a long table that can store everything is probably small; the cost is that querying a long table will be harder to code and less efficient (and indexing it to speed up queries would probably be prohibitively storage-intensive); this might, for instance cause loading lags for coders when loading new articles, especially for databases with lots of articles.
I think I'm a little less worried about the downside than @alexhannahttps://github.com/alexhanna is, but I do see her point. I think a sensible way to proceed will be to get the (long) full list of fields in SOLR -- ideally with some counts of how many articles have them (though alas this makes the task much trickier) -- and to assess which ones we actually care about? I think that will give us a better picture of what we'd lose with a wide-table format that only keeps those important fields... So unless I hear otherwise I'll work on getting that list together so we can have a look at it.
THAT SAID: @olderwomanhttps://github.com/olderwoman, if you did want to do full-text searches like you have in the past to identify articles that mentioned particular words, Solr will be much better for that. MySQL is not built for that and it will likely slow the server down a lot.
This is true, and I hadn't thought of it before. @olderwomanhttps://github.com/olderwoman, I'm guessing that in the past you've used this kind of searching mainly to audit the haystack task? If so then I think making MAI independent of SOLR won't make any difference; but maybe I'm overlooking some value of being able to search article text?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-597771130, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5I3DDYHIQRTHWPVH6LRG7EEPANCNFSM4LAVKKVQ.
There are a variety of reasons to search article text for non-MPEDS articles. I’ve used it for (a) checking the screening of articles for US locations, in which I checked a few hundred and found a low % of errors, all of them involving mentions of other countries that confused the sorter. (b) trying to figure out why the Cincinnati riot articles selected by MPEDS didn’t appear until April 12, verifying that the newswires DID cover the events earlier but did not use the words protest or riot, which is probably why they didn’t get selected. (c) checking to see whether events referred to retrospectively in our data really were not originally covered in the newswires. (Mostly they were not, sometimes they actually were and MPEDS missed them for one reason or another.) I also occasionally search the newswires for other reasons, e.g. to check whether a Black protest event mentioned in a book was or was not covered by them. I would expect to be doing similar searches in the Black newspapers as the project moves forward.
Hmm, so it sounds like b) and c) are instances of checking MPEDS's haystack performance? And so it makes sense for them to use MPEDS's data source (SOLR) to do them? But it sounds like a) might be to audit something that happened in MAI; and it's hard to tell for other searches.
In the past, have you done this kind of searching using the SOLR admin interface? If text search really is a high priority for the coding work that we do with MAI (as opposed to the haystack work that we do with MPEDS), then maybe moving MAI away from SOLR isn't a good idea; although it sounds to me like most of these text-searching tasks are somewhat peripheral to MAI?
Let’s see, yes b & c check the haystack. a) checked the geolocation routine, which was something Alex did between MPEDS and putting things into MAI, I believe. She first sent me a spreadsheet of ALL MPEDS-selected articles then, a couple days later, a subset that were geolocated to the US, to reduce the large # of international-only stories to wade through. I looked up a sample of articles that were in the original spreadsheet but not the second spreadsheet. Yes most of my other text searching is peripheral to MAI and more about taking advantage of the fact that I have a large cache of news sources to check facts in. This last is an intermittent task, so having the searches be relatively slow or inefficient would not be a big problem I don’t think.
From: davidskalinder notifications@github.com Sent: Wednesday, March 11, 2020 1:44 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)
There are a variety of reasons to search article text for non-MPEDS articles. I’ve used it for (a) checking the screening of articles for US locations, in which I checked a few hundred and found a low % of errors, all of them involving mentions of other countries that confused the sorter. (b) trying to figure out why the Cincinnati riot articles selected by MPEDS didn’t appear until April 12, verifying that the newswires DID cover the events earlier but did not use the words protest or riot, which is probably why they didn’t get selected. (c) checking to see whether events referred to retrospectively in our data really were not originally covered in the newswires. (Mostly they were not, sometimes they actually were and MPEDS missed them for one reason or another.) I also occasionally search the newswires for other reasons, e.g. to check whether a Black protest event mentioned in a book was or was not covered by them. I would expect to be doing similar searches in the Black newspapers as the project moves forward.
Hmm, so it sounds like b) and c) are instances of checking MPEDS's haystack performance? And so it makes sense for them to use MPEDS's data source (SOLR) to do them? But it sounds like a) might be to audit something that happened in MAI; and it's hard to tell for other searches.
In the past, have you done this kind of searching using the SOLR admin interface? If text search really is a high priority for the coding work that we do with MAI (as opposed to the haystack work that we do with MPEDS), then maybe moving MAI away from SOLR isn't a good idea; although it sounds to me like most of these text-searching tasks are somewhat peripheral to MAI?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-597803358, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5IXZNJ5G6VU54EIBIDRG7LXZANCNFSM4LAVKKVQ.
Let’s see, yes b & c check the haystack. a) checked the geolocation routine, which was something Alex did between MPEDS and putting things into MAI, I believe. She first sent me a spreadsheet of ALL MPEDS-selected articles then, a couple days later, a subset that were geolocated to the US, to reduce the large # of international-only stories to wade through. I looked up a sample of articles that were in the original spreadsheet but not the second spreadsheet. Yes most of my other text searching is peripheral to MAI and more about taking advantage of the fact that I have a large cache of news sources to check facts in. This last is an intermittent task, so having the searches be relatively slow or inefficient would not be a big problem I don’t think.
Okay, so it sounds like most of the important uses of text search are MPEDS-adjacent and so can happily live in whatever universe MPEDS lives in without staying coupled to the MAI universe, huh...
Just one point to clarify: my (and, I think, @alexhanna's) worry about slow/inefficient querying isn't so much for the text search functionality (which would be "slowed down" by requiring the user to visit the SOLR interface, or the source files); it's that putting whole articles and/or lots of junk article fields into a (long-format, especially) MAI table might slow down everything MAI does with those articles (such as display them to coders). Which would obviously be bad.
So it sounds like we should continue with our plan for, ahem, SOLRicide, but review the fields in SOLR to see if we can avoid a long-format table by surviving without most of them...
One possibility re multiple fields is to list the TYPES of fields we need and map source fields into our fields as part of the inputting process
From: davidskalinder notifications@github.com Sent: Wednesday, March 11, 2020 3:50 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)
Let’s see, yes b & c check the haystack. a) checked the geolocation routine, which was something Alex did between MPEDS and putting things into MAI, I believe. She first sent me a spreadsheet of ALL MPEDS-selected articles then, a couple days later, a subset that were geolocated to the US, to reduce the large # of international-only stories to wade through. I looked up a sample of articles that were in the original spreadsheet but not the second spreadsheet. Yes most of my other text searching is peripheral to MAI and more about taking advantage of the fact that I have a large cache of news sources to check facts in. This last is an intermittent task, so having the searches be relatively slow or inefficient would not be a big problem I don’t think.
Okay, so it sounds like most of the important uses of text search are MPEDS-adjacent and so can happily live in whatever universe MPEDS lives in without staying coupled to the MAI universe, huh...
Just one point to clarify: my (and, I think, @alexhannahttps://github.com/alexhanna's) worry about slow/inefficient querying isn't so much for the text search functionality (which would be "slowed down" by requiring the user to visit the SOLR interface, or the source files); it's that putting whole articles and/or lots of junk article fields into a (long-format, especially) MAI table might slow down everything MAI does with those articles (such as display them to coders). Which would obviously be bad.
So it sounds like we should continue with our plan for, ahem, SOLRicide, but review the fields in SOLR to see if we can avoid a long-format table by surviving without most of them...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-597871771, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5KXZZ2AWSCTRG2BL6LRG72PJANCNFSM4LAVKKVQ.
One possibility re multiple fields is to list the TYPES of fields we need and map source fields into our fields as part of the inputting process
Yes, I think this is more or less how it'll be (and indeed I suspect @alexhanna's scripts are doing a bit of this already to load the source files into SOLR. Hopefully the field list will clarify.
That list is somewhat short. Only the metadata which is guaranteed to be in Solr, which is listed here.
That list is somewhat short. Only the metadata which is guaranteed to be in Solr, which is listed here.
Hah, yep, that list is powerful short. I think the issue is whether there's other stuff that we can or should be using (which we don't know about until we review the (long)longlist of fields)? It would be just peachy if we really only need those six (plus text) though...
That list is somewhat short. Only the metadata which is guaranteed to be in Solr, which is listed here.
Hmm, so @alexhanna, I'm finally to the point where I can actually see some data for this, and I notice that several of those fields frequently don't appear -- looking just at my little test dataset of 19 articles about Wisconsin, none of them contain 'SECTION'
, 'BYLINE'
, or 'DATELINE'
; and now that I think about it, several fields that every article must have (such as 'TITLE'
and 'TEXT'
) aren't in the linked list. Might that list have changed, or maybe there's another bit of code somewhere that guarantees that certain fields show up in SOLR?
All right, @olderwoman, @alexhanna, @matthewsmorgan, @limchaeyoon, @johnklemke: I've now got a bunch of other issues to the point where they can provide some data for this question. Below is a list of the fields in use for the 6453 articles currently in our production deployment. (Note that as for the articles mentioned above, none of these contain DATELINE
or BYLINE
, or SECTION
, although some articles do have a field called Section
.)
Also, I dumped everything SOLR had stored for these articles to a CSV file, which is at gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv
.
So I think the object of the game here is to try to ensure that we won't need more than a small number of these fields for MAI or subsequent work (although of course everything will still be stored somewhere to audit MPEDS's article selection). I haven't really looked at the substance of any of this yet, but I'll try to do that soon and make a recommendation. In the meantime, recommendations from y'all are welcome!
Here's the current field usage for our 6453 articles in production:
Nonmissing entries in each column:
DATE 6453
INTERNAL_ID 6453
PUBLICATION 6453
TEXT 6453
TITLE 6453
Abstract 6453
Accession_number 2485
Author 5501
Copyright 6453
Country_of_publication 6453
DOCSOURCE 6453
Database 6453
Document_URL 6453
Document_feature 2483
Document_type 6453
Ethnicity 6107
ISSN 4363
Issue 6017
Language_of_publication 6453
Last_updated 6453
Links 6317
Location 4773
Number_of_pages 6037
People 3199
Place_of_publication 6453
ProQuest_document_ID 6453
Publication_date 6453
Publication_subject 6453
Publication_year 6453
Publisher 6453
Section 1373
Source_type 6453
Subject 5998
Volume 6015
Year 6453
_version_ 6453
id 6453
Of course let me know if anything needs clarification...
I’ve got to do some other stuff today and then have several meetings tomorrow so may not make a lot of progress on this until Thursday. OK to complain if I have not responded by Thursday April 9.
I'm not entirely sure what's the best way to compile opinions on this, so I'll post this here for now but anyone can feel free to suggest a better location. Having looked at the fields, I think the following almost certainly should be kept through post-MPEDS steps:
DATE
Document_URL
PUBLICATION
TEXT
TITLE
id
I think the following can almost certainly be abandoned:
Abstract
Accession_number
Author
Country_of_publication
Document_feature
Ethnicity
ISSN
Issue
Language_of_publication
Number_of_pages
Publication_date
Publication_subject
Publication_year
Section
Source_type
Subject
Volume
Year
And I'm not sure about the following ones -- either I need to check more carefully to see whether they are duplicated elsewhere or I can't tell whether they add any value to post-MPEDS processes:
Publisher
Copyright
DOCSOURCE
Database
Document_type
INTERNAL_ID
Last_updated
Links
Location
People
Place_of_publication
ProQuest_document_ID
_version_
Incidentally, it seems that Publisher
sometimes disagrees with what's before the semicolon in PUBLICATION
, which seems odd; I'll try to find the code (probably here?) that builds the PUBLICATION
field later to make sure it did it correctly (unless @alexhanna can straighten me out?)...
See comments below. In short, we need more carefully to review exactly what is in these fields from different sources.
From: davidskalinder notifications@github.com Sent: Tuesday, April 7, 2020 7:32 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)
I'm not entirely sure what's the best way to compile opinions on this, so I'll post this here for now but anyone can feel free to suggest a better location. Having looked at the fields, I think the following almost certainly should be kept through post-MPEDS steps:
DATE // is this publication date?
Document_URL
PUBLICATION
TEXT
TITLE
id
I think the following can almost certainly be abandoned:
Abstract
Accession_number // This would be the pointer back to the data source, possibly
Author
Country_of_publication
Document_feature
Ethnicity
ISSN // check what this is, may be relevant
Issue // part of citation for some sources as in volume:issue
Language_of_publication
Number_of_pages
Publication_date // this is needed
Publication_subject
Publication_year
Section // this is part of a newspaper citations
Source_type // need to see what these are
Subject
Volume // part of a citation
Year
And I'm not sure about the following ones -- either I need to check more carefully to see whether they are duplicated elsewhere or I can't tell whether they add any value to post-MPEDS processes:
Publisher
Copyright
DOCSOURCE // what are the values, may be needed
Database // we need this
Document_type
INTERNAL_ID // this is probably needed
Last_updated // possibly part of the citation
Links
Location // in some cases may be the dateline, may be needed
People
Place_of_publication // needed if not part of the citation
ProQuest_document_ID // this is definitely needed to document the file
version
Incidentally, it seems that Publisher sometimes disagrees with what's before the semicolon in PUBLICATION, which seems odd; I'll try to find the code (probably here?https://github.com/alexhanna/news-parsers/blob/master/proquest.py) that builds the PUBLICATION field later to make sure it did it correctly (unless @alexhannahttps://github.com/alexhanna can straighten me out?)...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-610686450, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5NNXE33PO357FBJHRDRLPAZXANCNFSM4LAVKKVQ.
Tl;dr: Based on feedback from @alexhanna and @olderwoman, I'm starting to think that for now at least, we shouldn't worry much about any SOLR fields that MAI doesn't use already.
Non-tl;dr version:
Talking to @alexhanna this morning, as usual, helped clarify this. Basically, there are very few fields that are guaranteed to be in SOLR. (It's not clear exactly which ones, but it's close to all the items in all-caps in the lists above.) Any other fields might vary by data source. Fundamentally, I think that means that MAI and any subsequent processes can't rely on anything except those fields being present (unless we do a great deal of cleaning of SOLR's several million records).
Also, reading through @olderwoman's responses, I think I'm getting a clearer idea of things we'd like to do with whatever data we carry beyond SOLR. Here's the list as I see it:
Goal 1 is actually relatively easy, since at the moment MAI only uses a small number of SOLR's fields, and there aren't many more that we want to start using. Goal 2 is satisfied by SOLR's id
field (which subsequent processes call SOLR_ID
). I think goal 3 is satisfied, as well as it can be, by the DOCSOURCE
and INTERNAL_ID
fields (and in cases where these are ambiguous, searches of content TITLE
or even TEXT
would probably be sufficient, and probably as good as we can do).
Goals 4, 5, and 6 seem trickier because they all require information that isn't stored in SOLR consistently. (A notable example is the EDITION
field, which seems to distinguish different version of articles in the New York Times but doesn't exist for any other publication; also, of course, every data source will likely have unique, and possibly unique numbers of, ways to identify any particular article.) Having said that, we should recall that at the moment, MAI and (I think?) subsequent processes can't accomplish goals 3, 4, 5, or 6 either: the only way to get information like this currently is to go back to upstream sources.
I think this means that for this issue (which is, or at least should be, to replicate current functionality without SOLR), we should focus on goals 1, 2, and maybe 3 to be on the safe side; goals 4, 5, and 6 seem like new feature requests and therefore should be treated as separate issues?
I also have line-by-line comments on @olderwoman's responses above, which I'll post in a separate comment.
Specific replies to the points raised by @olderwoman:
See comments below. In short, we need more carefully to review exactly what is in these fields from different sources.
Just to be clear, my recommendations here were based on looking at the articles currently in our production version (which look like they came from Proquest?). But as noted above, most of these fields might be completely different in articles from different sources.
DATE // is this publication date?
It seems to be a cleaned-up version of publication date, yes.
Accession_number // This would be the pointer back to the data source, possibly
Only about a third of the articles have this field, so I doubt it; and also, other fields (such as Document_URL
and ProQuest_document_ID
) seem to be unique identifiers for the articles. Probably a good idea to find out what "accession number" means in order to be sure though.
ISSN // check what this is, may be relevant
This is missing for about two-thirds of the articles, and for the rest it looks like each publication has its own ISSN, so hundreds of articles all have the same number; so I assume it's not doing us much good in the database, but of course others might know better...
Issue // part of citation for some sources as in volume:issue
Ah, yes, if we want to keep all the citation information then perhaps we should keep this. (EDIT: but see https://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-611150228)
Publication_date // this is needed
Is there a reason why we would need both this and the more-consistently-formatted DATE
field with the same information?
Section // this is part of a newspaper citations
Is this for ASA style, or some other style? I can't find anything about sections in my ASA 4th ed., and I don't recall needing these for other styles, but you certainly know better than I. These are frequently missing and often slightly silly, FWIW (e.g., "BACK TO SCHOOL", "BILL'S BUSINESS", or often simply "NEWS").
Source_type // need to see what these are
For this particular dataset, it's the word "Newspapers" for all 6,453 entries.
Volume // part of a citation
My reading of ASA, CMOS, and APA styles is that volume is only used for journals? But again, happy to defer to you on this stuff.
DOCSOURCE // what are the values, may be needed
Check the CSV for the full list, but the values are things like these:
Proquest (ethnic_newswatch_publication_files/Ethnic Newswatch Publication Files/Ethnic News Hyde Park 1991-2015 1-1000.txt)
Proquest (ethnic_newswatch_publication_files/Ethnic Newswatch Publication Files/EthnicNews Chicago Citizen 1991-2015 2001-2700.txt)
These seem to be the names of the files that were loaded into SOLR. There are between like 10 and 120ish articles from each of these source files.
Database // we need this
What made me hesitate is that this is simply "Ethnic NewsWatch" for all 6,453 articles. But if we're not storing this info elsewhere then yeah maybe we need to keep this field.
INTERNAL_ID // this is probably needed
INTERNAL_ID
is almost the same as id
but with an integer on the end instead of a hash. (id
is what is elsewhere called SOLR_ID
.) @alexhanna reports that the integer is a not-very-consistent identifier within the source file, usually something like the 'x' in the source files' "Article x of 500" lines.
Last_updated // possibly part of the citation
I'm guessing this indicates when it was updated in some database or other, since 1,905 articles all have the same value. However some of the dates are in 2010, so I doubt it's for any of our systems. Maybe this is when it was updated in ProQuest?
Location // in some cases may be the dateline, may be needed People
Yeah I wasn't sure whether either or both of these would come in handy. But FWIW, I don't think either of these have ever been accessible except via a custom query using SOLR's web interface. Also @alexhanna indicates that fields like these are often inconsistent and might be totally different or missing in different data sources.
Place_of_publication // needed if not part of the citation
Not sure what you mean by citation, but for these articles at least, Place_of_publication
is always identical to what's after the semicolon in PUBLICATION
(though I don't know whether this will be true for other data sources). Again, I don't think this field has been in MAI or in pass 2 up to now.
ProQuest_document_ID // this is definitely needed to document the file
In these articles, this ID is always part of the Document_URL
field; since that field also includes other information, I recommended we keep that instead. However other data sources probably include neither a ProQuest_document_ID
nor a Document_URL
field.
Basically what we need is a datafile that has all fields minus the full text so we can figure out which sources use which fields and what is in them. There is no way to figure this out just by looking at frequencies.
Basically what we need is a datafile that has all fields minus the full text so we can figure out which sources use which fields and what is in them. There is no way to figure this out just by looking at frequencies.
Uh, yes, sorry -- from my comment yesterday:
Also, I dumped everything SOLR had stored for these articles to a CSV file, which is at
gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv
.
I guess let me know if you're having trouble getting to that file and I can put a copy someplace more convenient? (It's about 38MB, so unfortunately email probably won't work.)
Ah, good. I’ll review that tomorrow or Friday. Bug me if I have not done it by midday Friday.
From: davidskalinder notifications@github.com Sent: Wednesday, April 8, 2020 2:56 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)
Basically what we need is a datafile that has all fields minus the full text so we can figure out which sources use which fields and what is in them. There is no way to figure this out just by looking at frequencies.
Uh, yes, sorry -- from my comment yesterday:
Also, I dumped everything SOLR had stored for these articles to a CSV file, which is at gdelt/Skalinder/SOLR_exports/solr_output_2020-04-07_162312.csv.
I guess let me know if you're having trouble getting to that file and I can put a copy someplace more convenient? (It's about 38MB, so unfortunately email probably won't work.)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-611161666, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5JNZNIXJ3TKAH5TTJDRLTJGHANCNFSM4LAVKKVQ.
I just realized it would be easy and valuable to also produce s CSV and field counts for @alexhanna's campus protest articles to get a sense of how different SOLR's contents are for different data sources (spoiler alert: very).
So there's now a new file at gdelt/Skalinder/SOLR_exports/campus_protest_solr_output_2020-04-09_105647.csv
with all the SOLR fields for the campus protest articles.
The count of non-empty fields in both databases is as follows:
field | bpp | campus_protest |
---|---|---|
DATE | 6453 | 9216 |
INTERNAL_ID | 6453 | 6667 |
PUBLICATION | 6453 | 9216 |
TEXT | 6453 | 9216 |
TITLE | 6453 | 9216 |
Abstract | 6453 | |
Accession_number | 2485 | |
Author | 5501 | |
Copyright | 6453 | |
Country_of_publication | 6453 | |
DOCSOURCE | 6453 | 9216 |
Database | 6453 | 9166 |
Document_URL | 6453 | |
Document_feature | 2483 | |
Document_type | 6453 | |
Ethnicity | 6107 | |
ISSN | 4363 | |
Issue | 6017 | |
Language_of_publication | 6453 | |
Last_updated | 6453 | |
Links | 6317 | |
Location | 4773 | |
Number_of_pages | 6037 | |
People | 3199 | |
Place_of_publication | 6453 | |
ProQuest_document_ID | 6453 | |
Publication_date | 6453 | |
Publication_subject | 6453 | |
Publication_year | 6453 | |
Publisher | 6453 | |
Section | 1373 | |
Source_type | 6453 | |
Subject | 5998 | |
Volume | 6015 | |
Year | 6453 | |
_version_ | 6453 | 9216 |
id | 6453 | 9216 |
BYLINE | 6107 | |
DATELINE | 40 | |
SECTION | 6667 | |
A | 1 | |
AM | 1 | |
DOCID | 50 | |
LANGUAGE | 6667 | |
LENGTH | 6667 | |
LOAD-DATE | 6667 | |
PUBLICATION-TYPE | 6667 | |
Q | 1 | |
URL | 2499 |
So as you can see, not a huge amount of overlap...
Yes, this is helpful. Thanks.
Just for background -- 6,667 of the campus protest articles from Lexis-Nexis, and 2,549 are scraped from the web.
Another new database's worth: this time, from the black-newswires deployment:
field | bpp | campus_protest | black-newswires |
---|---|---|---|
DATE | 6453 | 9216 | 2471 |
INTERNAL_ID | 6453 | 6667 | |
PUBLICATION | 6453 | 9216 | 2471 |
TEXT | 6453 | 9216 | 2471 |
TITLE | 6453 | 9216 | 2471 |
Abstract | 6453 | ||
Accession_number | 2485 | ||
Author | 5501 | ||
Copyright | 6453 | ||
Country_of_publication | 6453 | ||
DOCSOURCE | 6453 | 9216 | 2471 |
Database | 6453 | 9166 | |
Document_URL | 6453 | ||
Document_feature | 2483 | ||
Document_type | 6453 | ||
Ethnicity | 6107 | ||
ISSN | 4363 | ||
Issue | 6017 | ||
Language_of_publication | 6453 | ||
Last_updated | 6453 | ||
Links | 6317 | ||
Location | 4773 | ||
Number_of_pages | 6037 | ||
People | 3199 | ||
Place_of_publication | 6453 | ||
ProQuest_document_ID | 6453 | ||
Publication_date | 6453 | ||
Publication_subject | 6453 | ||
Publication_year | 6453 | ||
Publisher | 6453 | ||
Section | 1373 | ||
Source_type | 6453 | ||
Subject | 5998 | ||
Volume | 6015 | ||
Year | 6453 | ||
_version_ | 6453 | 9216 | 2471 |
id | 6453 | 9216 | 2471 |
BYLINE | 6107 | ||
DATELINE | 40 | 2088 | |
SECTION | 6667 | ||
A | 1 | ||
AM | 1 | ||
DOCID | 50 | 2471 | |
LANGUAGE | 6667 | ||
LENGTH | 6667 | ||
LOAD-DATE | 6667 | ||
PUBLICATION-TYPE | 6667 | ||
Q | 1 | ||
URL | 2499 |
And another update, now with data from the bp
deployment (which might have had data from the 'boycott error' era). Note that I've sorted the fields alphabetically.
As the sage @alexhanna hath foretold, the only fields common to all of these sources are _version_
, DATE
, DOCSOURCE
, id
, PUBLICATION
, TEXT
, and TITLE
. (In fact, every article has all of these except for a few that are missing TITLE
for some reason.)
field | bpp | campus_protest | black-newswires | bp |
---|---|---|---|---|
_version_ | 6453 | 9216 | 2471 | 34141 |
A | 44 | |||
A | 1 | |||
AA | 1 | |||
Abstract | 6453 | |||
Accession_number | 2485 | |||
AGE | 2 | |||
AM | 1 | |||
ATWATER | 1 | |||
AU | 1 | |||
Author | 5501 | |||
BOESAK | 1 | |||
BROWN | 1 | |||
BUSH | 5 | |||
BYLINE | 13289 | |||
BYLINE | 6107 | |||
CLINTON | 4 | |||
Copyright | 6453 | |||
CORNYN | 1 | |||
CORRECTION | 6 | |||
Country_of_publication | 6453 | |||
Database | 6453 | 9166 | ||
DATE | 6453 | 9216 | 2471 | 34141 |
DATELINE | 40 | 2088 | 20000 | |
DEMOCRATS | 1 | |||
DISTRIBUTION | 112 | |||
DOCID | 50 | 2471 | 14710 | |
DOCSOURCE | 6453 | 9216 | 2471 | 34141 |
Document_feature | 2483 | |||
Document_type | 6453 | |||
Document_URL | 6453 | |||
DOCUMENT-TYPE | 55 | |||
DURBIN | 1 | |||
EDITION | 9881 | |||
Ethnicity | 6107 | |||
FANG | 1 | |||
GONZALES | 1 | |||
GRAHAM | 1 | |||
GRAPHIC | 2303 | |||
HIGHLIGHT | 15 | |||
id | 6453 | 9216 | 2471 | 34141 |
INTERNAL_ID | 6453 | 6667 | 19431 | |
IS_DOCA | 6880 | |||
ISSN | 4363 | |||
Issue | 6017 | |||
LANGUAGE | 6667 | 11328 | ||
Language_of_publication | 6453 | |||
Last_updated | 6453 | |||
LEAD | 1824 | |||
LENGTH | 6667 | 18278 | ||
Links | 6317 | |||
LOAD-DATE | 6667 | 5714 | ||
Location | 4773 | |||
MARYLAND | 1 | |||
Number_of_pages | 6037 | |||
People | 3199 | |||
PEROT | 4 | |||
PHOTO | 1 | |||
Place_of_publication | 6453 | |||
PRINT_COLUMN | 6874 | |||
PRINT_PAGE_NUMBER | 6880 | |||
PRINT_SECTION | 6874 | |||
ProQuest_document_ID | 6453 | |||
PUBLICATION | 6453 | 9216 | 2471 | 34141 |
Publication_date | 6453 | |||
Publication_subject | 6453 | |||
Publication_year | 6453 | |||
PUBLICATION-TYPE | 6667 | 2191 | ||
Publisher | 6453 | |||
Q | 1 | 48 | ||
QUESTION | 1 | |||
RE | 9 | |||
Section | 1373 | |||
SECTION | 6667 | 18141 | ||
SERIES | 229 | |||
SOURCE | 846 | |||
Source_type | 6453 | |||
SPECTER | 1 | |||
Subject | 5998 | |||
TEXT | 6453 | 9216 | 2471 | 34140 |
TITLE | 6453 | 9216 | 2471 | 33925 |
TV | 1 | |||
TYPE | 2963 | |||
URL | 2499 | 6880 | ||
VIRGINIA | 1 | |||
Volume | 6015 | |||
Year | 6453 | |||
ZOGBY | 1 |
This is a little sketchy without the tabs; can you put a tab delimited version into a spreadsheet of some sort and put it either on gdelt or in the Google drive (I’ve created a folder inside technical documentation)? I’m afraid it won’t parse right from this email.
This is a little sketchy without the tabs; can you put a tab delimited version into a spreadsheet of some sort and put it either on gdelt or in the Google drive (I’ve created a folder inside technical documentation)? I’m afraid it won’t parse right from this email.
Yeah, it'll format correctly if you view the thread in a browser; however you're right that even there it's getting a bit ungainly. So I've put it into a spreadsheet in the Google Drive in Technical documentation > MAI > SOLR_field_usage
.
thanks
Thanks @olderwoman! I've copied your response to the issue thread so I can find it later:
I have reviewed all the SOLR fields. David copied all the fields & their counts into a spreadsheet and I have annotated the spreadsheet with what the fields are and whether they are needed. Note that some fields have comparable information for different sources, e.g. location has the same info as dateline. I did NOT review the campus protest info; it is not in the bp data set. The bp dataset looks like what was being used for the core MPEDS analysis, includes LDC (the NYT database), articles selected from various newssources from Lexis-Nexis to test MPEDS, and the full Gigaword dataset. Ethnic newswatch articles are in bpp. The black-newswires is a subset I’m guessing of bp; perhaps we can test this theory. But I think black-newswires is what was sent to the black-newswires MIA instance.
(The spreadsheet is the one at Technical documentation > MAI > SOLR_field_usage
.)
So: thanks @olderwoman, super helpful! I think we should probably discuss further on Monday? I'd say my comment upthread at https://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-611150228 (NB you might have to view the thread in a web browser to see the link) still represents the crux of my thinking on this, so it'd be good to discuss that plan and how the existing fields fit into it...
A summary of the discussion from this morning's meeting:
The goals I came up with in my comment above seem to be a good list, with the added consideration that items 4-6 should be able to be performed by non-BPP recipients of the dataset:
1. Ensure that MAI and subsequent passes have the data necessary to code and analyze 2. Be able to find any article in SOLR 3. Be able to find any article within the files that were loaded into SOLR 4. Be able to request any article from an external database provider (ProQuest/LexisNexis/etc.) 5. Be able to request any article from the original publication 6. Be able to cite any article within an academic bibliography
Based on my and @olderwoman's review of the existing SOLR fields, it's clear that there are about a half-dozen fields that have already been cleaned by @alexhanna's scripts when they were loaded into SOLR, and most other SOLR fields are of varying value and are stored inconsistently. One particular piece of inconsistently-stored information stands out as being especially valuable for coding: article location (which is sometimes stored in Location
and sometimes in BYLINE
, at least).
This suggests three major tasks (and some subtasks) with a clear priority order:
Getting the first of these major tasks done should mostly cover goals 1-3 in the list of goals above. The second major task will provide a high-value boost to goal 1 (by adding some important functionality to the coding interface). The last major task will mostly focus on goals 4-6 above, which will be important down the road but so far have not been doable in any systems downstream from SOLR (and so will stay that way for the time being).
So unless new information emerges, I'll get these tasks into separate issues and start in on the first one...
Correction: location = DATELINE not Byline.
Correction: location = DATELINE not Byline.
Yes, quite right -- thanks!
Reviving this epic by adding a few issues to it that should have been there in the first place. For people who can't see which issues are in a ZenHub epic, and to add some comments on priority, here's the status atm:
Open, high-priority, in order of necessary implementation I think:
enable_internal_article_storage
, waiting to PR)add_clean_solr_fields
, waiting to PR)etl_clean_fields
, waiting to PR)Open, lower-priority, in descending priority order:
Closed:
EDIT: Added #111 EDIT: Moved #111 from top to bottom of high-priority section
Now that #111 is to the point where I know that the etl_clean_fields can be PR'd cleanly, I think I should put off the actual PR, since I'll likely build the full-text-pulling functionality on top of that branch. So I'll edit the priority list above to move #111 from the top to the bottom of the high-priority section.
I just remembered something the BPP team discussed Monday that's worth noting here, both for @alexhanna's benefit and to remind me: we realized that since we'll probably want to add some more articles to our deployment in early January (or ASAP!), and given that we/you have recently practiced that procedure with the current setup, it's probably actually best not to deploy all the changes in this issue until after that batch of articles gets added.
It still might be worth accepting PRs into test deployments of course, but our thinking was that it's probably best not to muck with what's working until after it's no longer imminently critical. So when everything is done I'll deploy it to our testing instance and recommend that @alexhanna do something similar on her end, but I shouldn't pull either my feature branches or the accepted changes from upstream into our production deployment until after the new articles are in.
All righty, the essential stuff for this should now all be built and should be deployed to a non-dev testing deployment and then PR'd (but not deployed to production until after the next article load is done).
EDIT: Before doing any merges, add
## whether or not to store articles internally
STORE_ARTICLES_INTERNALLY = True
... to config.py
. (It's fine to set the value to False
, but it chokes if the thing isn't there at all. Which is... fine? I guess?)
Then, branches that need merging to get this into the testing/live deployment:
add_clean_solr_fields
etl_clean_fields
enable_internal_article_usage
EDIT: Also, python scripts/etl_clean_fields.py
needs to be run (from the mpeds environment) after etl_clean_fields
is merged.
This all now looks good in the testing deployment.
So, after we load the next new batch of articles in early Jan, I'll PR these. I should include notes as follows:
add_clean_solr_fields
: Needs the ALTER TABLE
from #113 (with any already-added fields removed)
etl_clean_fields
: Any time after accepting this, run scripts/etl_clean_fields.py
from the mpeds environment to get all the stuff from Solr into MAI's DB (where, until the next step, it won't do anything).
enable_internal_article_usage
: Before making this live, add STORE_ARTICLES_INTERNALLY = True
to config.py
(or, to keep using Solr, use STORE_ARTICLES_INTERNALLY = False
)
And now also need to PR load_article_data_from_csv
, which includes add_clean_solr_fields
. load_article_data_from_csv
shouldn't do anything until setup.py
is run, at which point config.py
should include DOC_ROOT
, DOC_FILE
, and DOC_DBNAME
in order to point at the correct import file.
@olderwoman, @matthewsmorgan, and @limchaeyoon, I thought I had better call your attention to this one:
@alexhanna and I agreed that given Solr's most recent flakiness, the best thing to do in order to get new articles loaded will be to simply to activate all the Solr-killing changes on the production deployment. I think that will be fine, but I don't think anybody except me has test-driven any of the new changes yet, so there is a slight risk of something ending up not to the team's liking.
However all the changes should be live now in the testing deployment if you want to give them a look. I believe that actually the only visible change is to the "Solr ID" that displays above the article -- in the new version, the entire ID is shown as it is in the database, whereas in the old version it's formatted differently and is missing the hash at the end. But it might not be a bad idea to have a poke at the testing deployment and make sure you don't spot anything wrong.
I'll probably be rolling everything out to production in the next few hours, so please give a shout if you see anything amiss (and TBH even if we miss something it might not be a disaster since we've only got two coders at this stage)...
I can take a quick look at this, is it the same url as the previous testing site?
I can take a quick look at this, is it the same url as the previous testing site?
Yep
I looked at this, it seems ok. Is there any functionality specifically I should test?
From: davidskalinder notifications@github.com Sent: Tuesday, February 2, 2021 5:18 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)
I can take a quick look at this, is it the same url as the previous testing site?
Yep
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-772081687, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5JYZBPUVLNK7DSOQNTS5CBZLANCNFSM4LAVKKVQ.
I looked at this, it seems ok.
Okay cool.
Is there any functionality specifically I should test?
Not really I don't think, since almost all the changes are under the hood. I'd say the usual checks of making sure all the expected form controls are there and that they save information across page loads as expected and such are all we need, just to make sure that the changes haven't broken anything unexpected and essential.
I didn’t try to break it, but I did code 2 articles, one with no events and another with one event and it seemed to work.
From: davidskalinder notifications@github.com Sent: Tuesday, February 2, 2021 5:27 PM To: davidskalinder/mpeds-coder mpeds-coder@noreply.github.com Cc: PAMELA E OLIVER pamela.oliver@wisc.edu; Mention mention@noreply.github.com Subject: Re: [davidskalinder/mpeds-coder] Kill SOLR? (#59)
I looked at this, it seems ok.
Okay cool.
Is there any functionality specifically I should test?
Not really I don't think, since almost all the changes are under the hood. I'd say the usual checks of making sure all the expected form controls are there and that they save information across page loads as expected and such are all we need, just to make sure that the changes haven't broken anything unexpected and essential.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-772085838, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBJJ5IXSDZ5XHWA6YQFXX3S5CC5XANCNFSM4LAVKKVQ.
I didn’t try to break it, but I did code 2 articles, one with no events and another with one event and it seemed to work.
Okay, sounds promising. I think things are pretty much squared away for all this to happen, but I'm going to hold on hitting the buttons for now since it'll be better to do it while I can monitor things for a few hours, and I need to touch up a mostly-done ASA submission tomorrow and the coders should still have at least several days of articles to go anyway. But hopefully this will be an easy roll-out, probably on Thurs.
All right, here goes. My plan today is to:
setup.py
on the production DB to load the 2Kish new articlesIn the at-this-point-pretty-likely event that nothing goes wrong, nobody will notice anything except a new hash string in the article headers. But I'll be on call today through the afternoon just in case.
This would be a big change under the hood of MAI, so I'm mentioning the whole team here since I don't want to do this without consultation: @olderwoman, @matthewsmorgan, @chaeyoonlim, @johnklemke. Also I think deciding this is a fairly high priority, since it will affect how I implement a bunch of other stuff (like pressing issues #58, #56, #45, as well as slightly-less-pressing issues #2 and possibly #53); and I know some people will be out of town the next few Mondays so I thought it'd be a good idea to raise the issue here. So please let me know if you have opinions, or if I can help clarify anything...
I'm thinking of excising SOLR from MAI altogether. Not right away necessarily, but moving in that direction. @alexhanna confirms that SOLR is part of MAI mostly for legacy reasons and because it's useful for MPEDS. Below are the pros and cons I can think of.
Pro:
Con:
Number_of_pages
,ProQuest_document_ID
, orAccession_number
? If we do need some, perhaps we could make a list?id
,field_name
, andfield_content
or some such); though this would make some joins a little more difficult.Implementation-wise, I think at the moment the only thing that actually uses SOLR at all is the controller's
loadSolr
function which loads up the article text! If that's right, then there wouldn't be very much to break with this change (in MAI itself anyway). There are also some bits (such as the controller'sconvertIDToPublication
function) that chisel information out of the SOLR IDs and could be implemented better. It's harder to know about pass 2 without knowing exactly what input @olderwoman's Stata scripts expect, although I'm relatively confident that we can reproduce old inputs using information in MAI as well as we would be able to with information in SOLR?