BridgesUNCC / cs-materials-react-webclient

https://cs-materials.herokuapp.com/
1 stars 0 forks source link

Think about what similarity means #12

Closed esaule closed 3 years ago

esaule commented 4 years ago

This is more a thinking piece than a doing something precise piece. We are going to need to build similarity views eventually. We have a prototype search feature.

But the underlying question is "what does it mean to be similar?" when the data we have are entries in a tree. Is similarity only we get an exact match of all nodes? Is it we can an exact match of all nodes? If I search for a node, should any of the downstream node also match, maybe at a lower level? If I search for a node should a sibling be considered similar to some extent?

Are there any work on something like that? I'll assign this one to the four of us. But Alec or Matthew, if you can search if anyone else had that problem before that could be useful. I am thinking there is, because Yahoo and Altavista used to index the web into a tree before Google reinvent searching.

krs-world commented 4 years ago

When we were working on the proposal,  I do remember comparing our data structures' radial graphs and seeing what I might have missed in correcting the classification. So we should have that capability, at least so we can assess to see how helpful it is.

Beyond that, we could have  comparision views that focus on a subtree of the graph and perhaps provide a match metric - this can be looking not just at exact match  -- it could integrate the matches in that subgraph and give some kind of measure.

I can also imagine having these measures automatically color the subgraphs based on the matches across two sets of assignments/materials (starting from a specific level of the hierarchy (say, Knowlege Area or something). This should be possible to do interactively so that you can quickly assess at a higher level where the major differences are.

We should also be thinking of bringing in a  temporal dimension (this again was in our proposal) - this might be a different visualization, but we need a way for users to get the sequence of topics/concepts covered in the course.  If we are getting courses classified, it would be very useful to see in what sequence they are covered as well.

Some of these you have implement and do some testing. Not everything  will pan out.

I can easily see a paper on this for  next year (comparison piece)!

    == krs

On 2/24/20 12:52 PM, Erik Saule wrote:

This is more a thinking piece than a doing something precise piece. We are going to need to build similarity views eventually. We have a prototype search feature.

But the underlying question is "what does it mean to be similar?" when the data we have are entries in a tree. Is similarity only we get an exact match of all nodes? Is it we can an exact match of all nodes? If I search for a node, should any of the downstream node also match, maybe at a lower level? If I search for a node should a sibling be considered similar to some extent?

Are there any work on something like that? I'll assign this one to the four of us. But Alec or Matthew, if you can search if anyone else had that problem before that could be useful. I am thinking there is, because Yahoo and Altavista used to index the web into a tree before Google reinvent searching.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BridgesUNCC/cs-materials-react-webclient/issues/12?email_source=notifications&email_token=ABLFEOCCKE6EXGAF4YS7GXLREQCNRA5CNFSM4K2NHWUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IPZZ4GQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLFEOEO5PLZGB6Z2AJAEATREQCNRANCNFSM4K2NHWUA.

-- Kalpathi Subramanian Ph: 704 687 8579 Associate Professor Email: krs@uncc.edu Dept of Computer Science Web:http://webpages.uncc.edu/krs The University of North Carolina Charlotte, NC 28223-0001

esaule commented 4 years ago

Beyond that, we could have comparision views that focus on a subtree of the graph and perhaps provide a match metric - this can be looking not just at exact match -- it could integrate the matches in that subgraph and give some kind of measure.

That match metric is what we are going to need pretty soon. Exact match is what we are doing now, and what we have been doing in the past. It is actually pretty dumb but I did not know what else to do.

I talked to Siddharth about it to get an idea of whether there was a standard approach to that. And, his answer was nothing directly comes to mind, but he'll think about it.

The basic fundamental problem is that we have a tree representing an ontology and two documents that are represented by binary hits on that tree.

If the entries were independent features, we'll do a simple count, Jaccard coefficient, cosine similarity or something like that.

But there is information encoded in the tree that these basic metrics do not account for. A simple idea, no idea if it is a good idea hitting a node propagates to the children, if you have 10 children, a hit on the parent node counts for 0.1 hit on each children. And a hit on the children would account for 0.1 hit on the parent. Or something like that. Should mapping to siblings account for something?

Maybe we should see a personalized page rank based on the hits, and compute similarity based on the page rank vectors.

Related question, how are we going to evaluate what metric makes sense?

Later on in the project, we should have enough data that we can probably machine learn the metrics. But right now, we don't. Yet we'll need something.

Then there are extension of that looking at comparing collections or looking at bloom levels.