Open bhpayne opened 7 months ago
Gephi uses Source,Target,Type,Id,Label,Weight
Source -> FileID Target -> TokenID Type -> (Latex Environments, Verbs, Math mode) Label -> Token:String Weight -> TF-IDF
Edges might consist of transformations applied to a node.
I think the structure implemented in the existing code base handles the 3 variant cases listed in the Wikipedia link.
Inverted index Impact-ordered postings, with the tf-idf scoring which can easily be replaced by similar scores such as BM-25 Positional postings list
https://en.wikipedia.org/wiki/Postings_list
Unique FileID:hash -> TokenIDS:list
Unique TokenID:hash -> FileIDS:list
unique primary key, not null (FileID,TokenID)
(FileID,TokenID) -> offsets:list
sample queries (*,*) -> returns list of lists of offsets for all fileids and all tokenids (FileID,*) -> returns list of lists for all tokenids:offset pairs. (*,TokenID) -> a list of all fileIDs that match token
or any combination of subsets of fileIDs or TokenIDs ([fileid1,fileid2,...],[tokenid1,tokenid5,...]) -> list of lists of offsets
what nodes and edges would be useful for a graph of Latex content? What properties would the nodes and edges have?