Closed davidskalinder closed 4 years ago
Did the counts in pandas, which I think is worth it for things like counting "nulls". So analyze_solr_articles.py
finds columns that contain multi-valued cells, counts nonmissing values in each column, and outputs the whole shebang into a CSV (at the moment, hard-coded into my home directory).
Still need to reorder columns so that high-value columns come first.
Should be all set now, in branch analyze_solr_articles
. Once it's merged into other deployments, running the script should produce the analyses for the corresponding DB.
Merged analyze_solr_articles
directly into production (to avoid empty #57). Also merged analyze_solr_articles
into testing.
Live in testing deployment, which doesn't appear to break anything immediately (which would be strange, since the whole branch is separate from almost all the rest of MAI). So I'm finna make it live in production also.
Script runs fine in testing deployment but not in production. Luckily it doesn't seem to break anything else...
Problem is almost certainly because of query length -- 20 IDs is fine, 7k, maybe not s'much. Will debug further in development deployment but maybe hack it to pull from the production DB...
SOLR API has (by default) a limit of 1024 boolean operators. This could be changed (in solrconfig.xml
), but instead I just changed the script to call the API in chunks instead. NB that having the list of 7k articles in memory a few times was enough to use it up.
Anyway, everything seems to be working, but need to check why 6453 article IDs have produced 7168 SOLR articles...
Fixed a few indexing errors and all looks good now, running in development deployment but on production data. So all set in analyze_solr_articles
and merged directly into testing
and master
.
Now live and tested in testing and production.
Below are the counts by field in the current production deployment. Note that although many of the fields were multi-entry, none (for these articles, anyway) contained more than one entry.
Nonmissing entries in each column:
DATE 6453
INTERNAL_ID 6453
PUBLICATION 6453
TEXT 6453
TITLE 6453
Abstract 6453
Accession_number 2485
Author 5501
Copyright 6453
Country_of_publication 6453
DOCSOURCE 6453
Database 6453
Document_URL 6453
Document_feature 2483
Document_type 6453
Ethnicity 6107
ISSN 4363
Issue 6017
Language_of_publication 6453
Last_updated 6453
Links 6317
Location 4773
Number_of_pages 6037
People 3199
Place_of_publication 6453
ProQuest_document_ID 6453
Publication_date 6453
Publication_subject 6453
Publication_year 6453
Publisher 6453
Section 1373
Source_type 6453
Subject 5998
Volume 6015
Year 6453
_version_ 6453
id 6453
Note to myself from conversation with AH 2020-04-08:
Some of the fields in SOLR aren't consistent -- AH's scripts tried to cobble them together Files are in gdelt ethnic_newswatch_something
Also for reference the list of every field in the SOLR instance (with field attributes like multiValued
, type
, required
etc., but not with any usage stats) is at server/solr/mpeds2/conf/managed-schema
from the SOLR instance root.
Originally posted by @davidskalinder in https://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-597771130:
This is almost done now that #61 is in the bag. I think the big remaining questions is the universe of articles to consider. There are several million in SOLR now, which is presumably too many to bother with. I could just get the info for the 7k-ish articles in BPP production, but of course that leaves out fields that might be useful in some other deployment... BPP production might be the best place to start though.
I could expand my skillz by figuring out how to do a bunch of nifty counts in pandas, but given that we'll need the underlying data in a convenient place anyway, I think an Excel workbook with pivottables is probably a better option to keep everything in one place.