Overall outline for EDA

l-monninger commented 1 year ago

Tokens and BFS:

Start by extracting categories for the papers.
Tokenize the papers. Show histograms and wordclouds for most common tokens for each category.
Join a subset (n=10^3) of the papers on itself by whether or not any of top tokens from two given papers.4.
```
A: [cat, dog, fishing] INTERSECTS B: [dog, bark, loud] ON [dog]
```
You will now have a graph that relates papers to papers by top tokens.
```
from   to    on
A         B     [dog]
```
Plot the percentage of joined papers (neighbors) that share the same category. You could potentially do this by order in the graph, i.e., first-order, second-order, third-order, though this may be compute intensive.
Repeat the above with a different corpus.

Okay, the above should show token joins are not good enough, so we start looking at embeddings.

Embed the paper abstracts and plot the embeddings colored by the given categories then by k-means clusters.
Plot the loss between given categories and computed clusters. This should show better accuracy than the token and BFS solution.
Compute embeddings for patents. Concatenation the data frames. Plot the clusters again and pull out a few random samples. Hopefully, this will turn up good results.

Model

Show a toy version of each model.
Plot loss for each variant.
Link to website.

ahreumcho commented 1 year ago

For the papers linking by token part, top tokens and lakes are too general like below. I think it looks a bit meaningless to link papers by this sort of too general tokens. (word clouds of top tokens and rakes are a bit different, but top 20 tokens are exactly same -> I can just erase rake tokens part ) Maybe I can just show these three version of word clouds and histograms, and then link papers only by ne tokens with explaining the limitations of general terms. Or, of course I can show three versions of graph to show the limitations of the three. What do you think? 1) top tokens : model, results, show, also, files, using, two 2) ne tokens 3) Rake tokens : model, results, show, also, files, using, two
Aren't we supposed to show link between papers and patents by tokens? I'm confused of token-joins, is it linking between papers or paper-patents? I thought we should do the latter. Then go further into embeddings.

l-monninger commented 1 year ago

Yes, this sounds good.
I would start by joining papers on itself because for papers you have categories. Using categories sort of as our labels, we can begin to determine how well recommending based on top tokens is. That is, we could recommend a bunch of tokens by selecting "adjacent" papers from the graph. If those "adjacent" papers don't share the same categories, then this probably isn't very good.

ahreumcho commented 1 year ago

It took too much time to find intersections (2 hours were not enough).. may need another way to do some graph work linking papers I don't know much about knowledge graph, but not by sentence which is done now but by title tokens or abstract tokens can we draw knowledge graph? Then it can be alternate way for linking papers. If the graph looks clear, we can check adjacent papers and compare their categories. Give some ideas on this !

l-monninger commented 1 year ago

@ahreumcho I believe I understand where you are stuck. You can do a cartesian product (full outer join) on your subset of the papers, then merge the two arrays and drop rows where arrays are less than length one. In Pandas, this should look something like...

# compute the cross product
product = papers.merge(papers, how="cross")

# compute the intersections of all arrays
def intersection(left, right):
    return list(set(left) & set(right))

product["shared_top_tokens"] = product.apply(lambda row : intersection(
    row["top_tokens_x"], row["top_tokens_y"]
))

# drop rows that do not intersect
product = product[product["shared_top_tokens"].map(len) > 0]
graph = product[["id_x", "id_y"]]

This is going to be $space = O(n^2)$, so you will very likely want to do this on a row and column subset of your data.

l-monninger commented 1 year ago

If you want to do the above over the whole dataset, you'll want to do subsets of the left otherwise you will run out of memory. You're not going to be able to improve the time performance unfortunately without randomized algs or something very esoteric.

# compute the intersections of all arrays
def intersection(left, right):
       return list(set(left) & set(right))

step = 1000
combined = pd.DataFrame()
for start in range(0, len(papers), step):
    # compute the cross product
    product = papers.iloc[start:start+step].merge(papers, how="cross")

   product["shared_top_tokens"] = product.apply(lambda row : intersection(
       row["top_tokens_x"], row["top_tokens_y"]
   ))

   # drop rows that do not intersect
   product = product[product["shared_top_tokens"].map(len) > 0]
   graph = product[["id_x", "id_y"]]
   combined.concat(graph)

ahl-big-data / ahl

Overall outline for EDA #10

Tokens and BFS: