davidskalinder / mpeds-coder

MPEDS Annotation Interface
MIT License
0 stars 0 forks source link

Get list of SOLR fields in use #67

Closed davidskalinder closed 4 years ago

davidskalinder commented 4 years ago

Originally posted by @davidskalinder in https://github.com/davidskalinder/mpeds-coder/issues/59#issuecomment-597771130:

I think a sensible way to proceed will be to get the (long) full list of fields in SOLR -- ideally with some counts of how many articles have them (though alas this makes the task much trickier) -- and to assess which ones we actually care about? I think that will give us a better picture of what we'd lose with a wide-table format that only keeps those important fields... So unless I hear otherwise I'll work on getting that list together so we can have a look at it.

This is almost done now that #61 is in the bag. I think the big remaining questions is the universe of articles to consider. There are several million in SOLR now, which is presumably too many to bother with. I could just get the info for the 7k-ish articles in BPP production, but of course that leaves out fields that might be useful in some other deployment... BPP production might be the best place to start though.

I could expand my skillz by figuring out how to do a bunch of nifty counts in pandas, but given that we'll need the underlying data in a convenient place anyway, I think an Excel workbook with pivottables is probably a better option to keep everything in one place.

davidskalinder commented 4 years ago

Did the counts in pandas, which I think is worth it for things like counting "nulls". So analyze_solr_articles.py finds columns that contain multi-valued cells, counts nonmissing values in each column, and outputs the whole shebang into a CSV (at the moment, hard-coded into my home directory).

Still need to reorder columns so that high-value columns come first.

davidskalinder commented 4 years ago

Should be all set now, in branch analyze_solr_articles. Once it's merged into other deployments, running the script should produce the analyses for the corresponding DB.

davidskalinder commented 4 years ago

Merged analyze_solr_articles directly into production (to avoid empty #57). Also merged analyze_solr_articles into testing.

davidskalinder commented 4 years ago

Live in testing deployment, which doesn't appear to break anything immediately (which would be strange, since the whole branch is separate from almost all the rest of MAI). So I'm finna make it live in production also.

davidskalinder commented 4 years ago

Script runs fine in testing deployment but not in production. Luckily it doesn't seem to break anything else...

davidskalinder commented 4 years ago

Problem is almost certainly because of query length -- 20 IDs is fine, 7k, maybe not s'much. Will debug further in development deployment but maybe hack it to pull from the production DB...

davidskalinder commented 4 years ago

SOLR API has (by default) a limit of 1024 boolean operators. This could be changed (in solrconfig.xml), but instead I just changed the script to call the API in chunks instead. NB that having the list of 7k articles in memory a few times was enough to use it up.

Anyway, everything seems to be working, but need to check why 6453 article IDs have produced 7168 SOLR articles...

davidskalinder commented 4 years ago

Fixed a few indexing errors and all looks good now, running in development deployment but on production data. So all set in analyze_solr_articles and merged directly into testing and master.

davidskalinder commented 4 years ago

Now live and tested in testing and production.

Below are the counts by field in the current production deployment. Note that although many of the fields were multi-entry, none (for these articles, anyway) contained more than one entry.

Nonmissing entries in each column:
DATE                       6453
INTERNAL_ID                6453
PUBLICATION                6453
TEXT                       6453
TITLE                      6453
Abstract                   6453
Accession_number           2485
Author                     5501
Copyright                  6453
Country_of_publication     6453
DOCSOURCE                  6453
Database                   6453
Document_URL               6453
Document_feature           2483
Document_type              6453
Ethnicity                  6107
ISSN                       4363
Issue                      6017
Language_of_publication    6453
Last_updated               6453
Links                      6317
Location                   4773
Number_of_pages            6037
People                     3199
Place_of_publication       6453
ProQuest_document_ID       6453
Publication_date           6453
Publication_subject        6453
Publication_year           6453
Publisher                  6453
Section                    1373
Source_type                6453
Subject                    5998
Volume                     6015
Year                       6453
_version_                  6453
id                         6453
davidskalinder commented 4 years ago

Note to myself from conversation with AH 2020-04-08:

Some of the fields in SOLR aren't consistent -- AH's scripts tried to cobble them together Files are in gdelt ethnic_newswatch_something

davidskalinder commented 4 years ago

Also for reference the list of every field in the SOLR instance (with field attributes like multiValued, type, required etc., but not with any usage stats) is at server/solr/mpeds2/conf/managed-schema from the SOLR instance root.