troubleshoot/optimize proper noun processing and display in Distant Reader

nkmeyers commented 4 years ago

The output of proper noun processing (NNP/NNPS) needs troubleshooting for the Distant Reader CORD19 project. Proper Nouns when processed and displayed should retain their capitalizations and multi-word proper nouns should be chunked back together for redisplay if possible. Why ? Because proper nouns refer to the name of a particular person, place, or thing that is spelled with a beginning capital letter: "San Francisco" and "White House" are proper nouns. But, in the prototype CORD19 Interventions Carrel displays of proper nouns are forced to lower and outputting single NNP tokens instead of multi word proper nouns like this:

and this

this ticket is related to code in bin/txt2pos.py & bin/about.pl and many more places in Distant Reader that act on redisplay of NNP tokens.

An improved display outcome for DR CORD 19 project might rely on chunking before re-processing or re-display of tokenized NNP/NNPS? see one solution that re-pairs up adjacent NNP tokens here https://stackoverflow.com/questions/49715600/nltk-joining-proper-nouns-after-tagging

feels like we should maybe fix it up or comment it out ? We could instead implement more specific NER options for CORD19 project and concurrently comment out the Distant Reader Classic "proper" options from the menus/displays ? but the NNP/NNPS are already tokenized "for free" and just ready and waiting for a little attention to be re-displayed a bit more meaningfully and the UI's already there for showing them . . . . so it feels like it makes more sense to pretty it up?

archaeocharlie commented 4 years ago

@nkmeyers It's very difficult to rebuild multi-word NNPs after they've been tokenized and tagged. Swapping in a sql statement here that uses the named entities is likely easiest. Which entity types might be best?

nkmeyers commented 4 years ago

@archaeocharlie , yes! that'd be fine. As the display of proper nouns is a bit misleading the way it's presented now ? For example here's the stuff that shows up as proper nouns in one of my carrel reports

Swapping in a sql statement to pull named entities in instead would improve on it for user. Maybe we could just substitute an NER section into the NNP place in the default carrel homepage/report?

Single words in word cloud prob not best way to visualize proper nouns which are can be multiword .

Can we do a booleanish query against token label = (NNP or NNPS) AND (NER TYPE = PERSON OR NORP OR FAC OR ORG OR GPE OR LOC OR PRODUCT OR EVENT )?

If we keep NNP in the default report it looks like in the default report we're forcing output of PROPN ->NNP token content output to lower before they're written to the tsv etc ? Maybe we should not do that lowering for NNP and NNPS?

If we keep NNP wordcloud and section as-is in the default report I'd like to escape single character NNPs unless there is a point in including them in the default report?

As right now it looks like it is built on NNP singular with no NNPS?

There's a bunch here to untangle? let's do it simplest most effective ways to substitute in for or improve on what we've got given what readers of COVID19 lit are likely looking for and what's likely to be most useful to them?

archaeocharlie commented 4 years ago

@nkmeyers the boolean (token label = (NNP or NNPS) AND (NER TYPE = PERSON OR NORP OR FAC OR ORG OR GPE OR LOC OR PRODUCT OR EVENT ) would be very difficult because of different units of analysis. The POS is looking at a token and the NER is looking at a span, which can be multiple tokens (e.g. United States of America would have different POS tags, not all NNP, but a single NER tag). It looks to me like the lowercase issue is coming from the SQL statement and not the data source.

I can swap in a different query to start that pulls some NER that should be proper nouns. I'm not sure how to test it locally, though, since this isn't a standalone script! Has containerization happened?

ericleasemorgan commented 4 years ago

Charlie, you are becoming well-acquainted with the code. I send you a taco. --ELM

ericleasemorgan commented 4 years ago

Correct, the lower-case "issue" is coming from the SQL and not the underlying data. Things were lower-cased in order to provide a bit of normalization.

Containerization? No, that has not happened, yet.

nkmeyers commented 4 years ago

@ericleasemorgan - naive question, but are you to-lowering before you run the proper noun process that identifies NNP NNPS tokens or running that process after you've to-lowered the text into a bag of words?

ericleasemorgan commented 4 years ago

"Are you to-lowering before you run the proper noun process that identifies NNP NNPS tokens?"

No.

"Or running that process after you've to-lowered the text into a bag of words?"

Yes.

Proper nouns are extracted from plain text files, and the result is saved to the underlying database. Their case is not altered in this process.

The frequency of proper nouns are calculated in carrel2about.py --> https://bit.ly/2B2ll8A And it is there where they are normalized to lower case. I did this for verbs and nouns in order to reduce the number of duplicates. Maybe such normalizing is not necessary for proper nouns.

-- Eric

ericleasemorgan commented 4 years ago

Removed lower-casing in carrel2about.py, and thus the proper nouns retain their original case. Future carrels or carrels which are rebuilt completely from scratch ought to demonstrate the new behavior.

nkmeyers commented 4 years ago

@ericleasemorgan Can you or someone else take a look at why proper nouns are not stopworded (Fig and Figure are still being interpreted as proper nouns even though they are in the stopwords list), and can someone take a look at why single characters are being interpreted as proper nouns - I think they are authors first initials - this method of proper nouns interpretation and reporting is still not very useful ? From https://cord.distantreader.org/carrels/test-bit/index.htm Proper Nouns An extraction of proper nouns helps you determine the names of people and places in your study carrel.

SARS, RNA, Fig, C, PCR, T, mg, coronavirus, B, China, A, HIV, II, Health, CT, IFN, S, Figure, ICU, M, N, CoV, ml, S., medRxiv, University, RT, C., vivo, United, E., L, USA, kg, pneumonia, IgG, M., CF, influenza, antigen, mRNA, ±, CD4, G, States, von, HCV, MS, doc_id, F

ericleasemorgan / reader

troubleshoot/optimize proper noun processing and display in Distant Reader #71