allofphysicsgraph / latex-in-arxiv

extract math latex from content in arxiv
4 stars 1 forks source link

identify potential schemas for a property graph for Latex content #23

Open bhpayne opened 7 months ago

bhpayne commented 7 months ago

what nodes and edges would be useful for a graph of Latex content? What properties would the nodes and edges have?

msgoff commented 7 months ago

Gephi uses Source,Target,Type,Id,Label,Weight

Source -> FileID Target -> TokenID Type -> (Latex Environments, Verbs, Math mode) Label -> Token:String Weight -> TF-IDF

Edges might consist of transformations applied to a node.

msgoff commented 6 months ago

I think the structure implemented in the existing code base handles the 3 variant cases listed in the Wikipedia link.

Inverted index Impact-ordered postings, with the tf-idf scoring which can easily be replaced by similar scores such as BM-25 Positional postings list

https://en.wikipedia.org/wiki/Postings_list

Unique FileID:hash -> TokenIDS:list

Unique TokenID:hash -> FileIDS:list

unique primary key, not null (FileID,TokenID)

(FileID,TokenID) -> offsets:list

sample queries (*,*) -> returns list of lists of offsets for all fileids and all tokenids (FileID,*) -> returns list of lists for all tokenids:offset pairs. (*,TokenID) -> a list of all fileIDs that match token

or any combination of subsets of fileIDs or TokenIDs ([fileid1,fileid2,...],[tokenid1,tokenid5,...]) -> list of lists of offsets