TheDataRideAlongs / ProjectDomino

Scaling COVID public behavior change and anti-misinformation
Apache License 2.0
61 stars 13 forks source link

Neo4j topic field for BERT scores #72

Open lmeyerov opened 4 years ago

lmeyerov commented 4 years ago

We want to record BERT model scores as properties on all our main entity types: URL, Tweet, Account. (Any others wrt Twitter?). Initially this is just one general topic classifier; later we may have multiple models (=scores) per entity for other aspects. It is currently unclear when the classifiers will be run/re-run.

For V1, the thinking is:

-- A manual job that trains the model, saves to disk (or uses pretrained) -- A manual job out-of-band scores all Tweets, but not accounts/urls -- ... It writes back into neo4j. It does not updating the record_modified_at, because base data is not modified?

For this ticket:

-- We need field(s) in Tweet, Account, and URL entities for the topic vector -- The vector is 1024 floats -- We'd benefit from an example of writing that datatype into neo4j & reading back -- is this 1024 diff fields, is it one binary buffer, ...?

Unclear for this ticket is data representation and indexing. Writing is generally indexed by entity id, which is fine. Ideally we can do similarity-based search, like "show all Tweets where their vector is within 2 units of this search vector". If neo4j can do natively, great, otherwise maybe we ultimately use FAISS as a service or out of band to generate and feed back as relations. For now, we can probably get away with doing a full DB scan on-the-fly where we page through all (tweet_id, topic vector)s.

cc @bechbd