Open l-monninger opened 1 year ago
It took too much time to find intersections (2 hours were not enough).. may need another way to do some graph work linking papers I don't know much about knowledge graph, but not by sentence which is done now but by title tokens or abstract tokens can we draw knowledge graph? Then it can be alternate way for linking papers. If the graph looks clear, we can check adjacent papers and compare their categories. Give some ideas on this !
@ahreumcho I believe I understand where you are stuck. You can do a cartesian product (full outer join) on your subset of the papers, then merge the two arrays and drop rows where arrays are less than length one. In Pandas, this should look something like...
# compute the cross product
product = papers.merge(papers, how="cross")
# compute the intersections of all arrays
def intersection(left, right):
return list(set(left) & set(right))
product["shared_top_tokens"] = product.apply(lambda row : intersection(
row["top_tokens_x"], row["top_tokens_y"]
))
# drop rows that do not intersect
product = product[product["shared_top_tokens"].map(len) > 0]
graph = product[["id_x", "id_y"]]
This is going to be $space = O(n^2)$, so you will very likely want to do this on a row and column subset of your data.
If you want to do the above over the whole dataset, you'll want to do subsets of the left otherwise you will run out of memory. You're not going to be able to improve the time performance unfortunately without randomized algs or something very esoteric.
# compute the intersections of all arrays
def intersection(left, right):
return list(set(left) & set(right))
step = 1000
combined = pd.DataFrame()
for start in range(0, len(papers), step):
# compute the cross product
product = papers.iloc[start:start+step].merge(papers, how="cross")
product["shared_top_tokens"] = product.apply(lambda row : intersection(
row["top_tokens_x"], row["top_tokens_y"]
))
# drop rows that do not intersect
product = product[product["shared_top_tokens"].map(len) > 0]
graph = product[["id_x", "id_y"]]
combined.concat(graph)
Tokens and BFS:
You will now have a graph that relates papers to papers by top tokens.
Okay, the above should show token joins are not good enough, so we start looking at embeddings.
Model