clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
417 stars 101 forks source link

Add visualize to the Inst trait #782

Closed kwalcock closed 4 months ago

kwalcock commented 5 months ago

@navalani in this PR are some things you might consider:

The output is below and one can imagine some structure and see the loop back to 3. One of the MatchTokens is probably for B-FOOD and one for I-FOOD and there should be some way to observe this. How good could this be made to look? A draft is below.

There was an extractor: foods-from-lexicon - Inst: SaveStart(--GLOBAL--)
There was an extractor: foods-from-lexicon (Next) - Inst: 2. MatchToken(org.clulab.odin.impl.EntityConstraint@606fc505) -> 3
There was an extractor: foods-from-lexicon (Next) (Next) - Inst: 3. Split.  Check out my LHS and RHS!
There was an extractor: foods-from-lexicon (Next) (Next) (LHS) - Inst: 4. MatchToken(org.clulab.odin.impl.EntityConstraint@5fa05212) -> 3
There was an extractor: foods-from-lexicon (Next) (Next) (RHS) - Inst: Pass
There was an extractor: foods-from-lexicon (Next) (Next) (RHS) (Next) - Inst: SaveEnd(--GLOBAL--)
There was an extractor: foods-from-lexicon (Next) (Next) (RHS) (Next) (Next) - Inst: Done

It might be

TokenExtractor for rule "foods-from-lexicon":

1. SaveStart(--GLOBAL--) -> 2
2. MatchToken(B-FOOD) -> 3
3. Split -> 4, 5
    4. MatchToken(I-FOOD) -> 3
    5. Pass -> 6
    6. SaveEnd(--GLOBAL--) -> 0
    0. Inst: Done

The numbers are fabricated. Maybe Done should be at the top and 1 indicated as entrypoint.

kwalcock commented 5 months ago

@navalani, these couple of additional rules are sufficient to provide simple examples of objects of the other Inst subclasses.

navalani commented 5 months ago

Thanks Keith, I will modify the drawing to include those subclasses

On Mon, Feb 19, 2024 at 1:43 PM Keith Alcock @.***> wrote:

External Email

@navalani https://github.com/navalani, these couple of additional rules are sufficient to provide simple examples of objects of the other Inst subclasses.

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#issuecomment-1953136515, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCCR2VMFG2HKUL6APDGBPD3YUO2OVAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJTGEZTMNJRGU . You are receiving this because you were mentioned.Message ID: @.***>

kwalcock commented 5 months ago

@navalani, hopefully the computer can soon make the visualizations itself.

navalani commented 4 months ago

Hey Keith,

I noticed in the code that we are currently visualizing the part of the rule that is matched, i.e (B-Food), (I-Food), etc. but do we need to visualize which part of the sentence that matches or is the next.getPosId sufficient?

Nick

On Fri, Mar 8, 2024 at 9:50 PM Keith Alcock @.***> wrote:

External Email

Merged #782 https://github.com/clulab/processors/pull/782 into nick-avalani/odin-debugger.

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#event-12062313492, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCCR2VOZFUBDH2ZJBYU2HNTYXKILBAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGA3DEMZRGM2DSMQ . You are receiving this because you were mentioned.Message ID: @.***>

kwalcock commented 4 months ago

Hi,

The first step is to show the B-Food or I-Food in a "visualization" so that the user can understand which Inst is in question. At this point we are still not matching against any text like in a sentence. The step after that is to show what the B-Food or I-Food turned out to match (or failed to match) in the sentence like the cake, pain au chocolat (or John eats).

Keith

On Fri, Mar 8, 2024 at 10:28 PM navalani @.***> wrote:

Hey Keith,

I noticed in the code that we are currently visualizing the part of the rule that is matched, i.e (B-Food), (I-Food), etc. but do we need to visualize which part of the sentence that matches or is the next.getPosId sufficient?

Nick

On Fri, Mar 8, 2024 at 9:50 PM Keith Alcock @.***> wrote:

External Email

Merged #782 https://github.com/clulab/processors/pull/782 into nick-avalani/odin-debugger.

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#event-12062313492, or unsubscribe < https://github.com/notifications/unsubscribe-auth/BCCR2VOZFUBDH2ZJBYU2HNTYXKILBAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGA3DEMZRGM2DSMQ>

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#issuecomment-1986737280, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCHCOU3SRAP46UGULIHWH3YXKMZJAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWG4ZTOMRYGA . You are receiving this because you modified the open/close state.Message ID: @.***>

navalani commented 4 months ago

I see. Thanks for the clarification. Will work on that right now.

Nick

On Fri, Mar 8, 2024 at 11:09 PM Keith Alcock @.***> wrote:

External Email

Hi,

The first step is to show the B-Food or I-Food in a "visualization" so that the user can understand which Inst is in question. At this point we are still not matching against any text like in a sentence. The step after that is to show what the B-Food or I-Food turned out to match (or failed to match) in the sentence like the cake, pain au chocolat (or John eats).

Keith

On Fri, Mar 8, 2024 at 10:28 PM navalani @.***> wrote:

Hey Keith,

I noticed in the code that we are currently visualizing the part of the rule that is matched, i.e (B-Food), (I-Food), etc. but do we need to visualize which part of the sentence that matches or is the next.getPosId sufficient?

Nick

On Fri, Mar 8, 2024 at 9:50 PM Keith Alcock @.***> wrote:

External Email

Merged #782 https://github.com/clulab/processors/pull/782 into nick-avalani/odin-debugger.

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#event-12062313492, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/BCCR2VOZFUBDH2ZJBYU2HNTYXKILBAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGA3DEMZRGM2DSMQ>

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#issuecomment-1986737280,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACCHCOU3SRAP46UGULIHWH3YXKMZJAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWG4ZTOMRYGA>

. You are receiving this because you modified the open/close state.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#issuecomment-1986745456, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCCR2VKHRQQBMCBSL3T5X6LYXKRPTAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWG42DKNBVGY . You are receiving this because you were mentioned.Message ID: @.***>

navalani commented 4 months ago

Do you have any suggestions on where to look for what matches the rule? For example, for MatchToken, I traced down the classes that extend TokenConstraint and StringMatcher but to no avail. They only hold [B-Food] and/or [I-Food] and not where the rule matches.

Nick

On Sat, Mar 9, 2024 at 5:24 PM Nick Avalani @.***> wrote:

I see. Thanks for the clarification. Will work on that right now.

Nick

On Fri, Mar 8, 2024 at 11:09 PM Keith Alcock @.***> wrote:

External Email

Hi,

The first step is to show the B-Food or I-Food in a "visualization" so that the user can understand which Inst is in question. At this point we are still not matching against any text like in a sentence. The step after that is to show what the B-Food or I-Food turned out to match (or failed to match) in the sentence like the cake, pain au chocolat (or John eats).

Keith

On Fri, Mar 8, 2024 at 10:28 PM navalani @.***> wrote:

Hey Keith,

I noticed in the code that we are currently visualizing the part of the rule that is matched, i.e (B-Food), (I-Food), etc. but do we need to visualize which part of the sentence that matches or is the next.getPosId sufficient?

Nick

On Fri, Mar 8, 2024 at 9:50 PM Keith Alcock @.***> wrote:

External Email

Merged #782 https://github.com/clulab/processors/pull/782 into nick-avalani/odin-debugger.

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#event-12062313492, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/BCCR2VOZFUBDH2ZJBYU2HNTYXKILBAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGA3DEMZRGM2DSMQ>

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#issuecomment-1986737280,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACCHCOU3SRAP46UGULIHWH3YXKMZJAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWG4ZTOMRYGA>

. You are receiving this because you modified the open/close state.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#issuecomment-1986745456, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCCR2VKHRQQBMCBSL3T5X6LYXKRPTAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWG42DKNBVGY . You are receiving this because you were mentioned.Message ID: @.***>

kwalcock commented 4 months ago

@navalani, you will eventually need the tok field of the Thread trait. I believe this keeps track of what token (of a sentence) is being matched against which Inst (of the rule). We might eventually keep a list of tokens that did or didn't match each Inst so that the user can verify that their rule is working as expected. For the moment we're still wanting to make sure the user and computer understand the rule in the same way based on the visualization of the Insts before they even start working. One thing that we're debugging is an incorrectly stated rule that doesn't mean what the rule writer thinks it does. We're more or less showing the assembly language that their code (rule) produced with enough clues so that they can verify it or at least do a sanity check.

navalani commented 4 months ago

Understood. Thanks

On Sat, Mar 9, 2024 at 9:27 PM Keith Alcock @.***> wrote:

External Email

@navalani https://github.com/navalani, you will eventually need the tok field of the Thread trait. I believe this keeps track of what token (of a sentence) is being matched against which Inst (of the rule). We might eventually keep a list of tokens that did or didn't match each Inst so that the user can verify that their rule is working as expected. For the moment we're still wanting to make sure the user and computer understand the rule in the same way based on the visualization of the Insts before they even start working. One thing that we're debugging is an incorrectly stated rule that doesn't mean what the rule writer thinks it does. We're more or less showing the assembly language that their code (rule) produced with enough clues so that they can verify it or at least do a sanity check.

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#issuecomment-1987081469, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCCR2VL5OTBAJCHGS7A2C33YXPOKLAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGA4DCNBWHE . You are receiving this because you were mentioned.Message ID: @.***>

navalani commented 4 months ago

I just pushed some code that added a visualization that shows what B-Food and B-Per match in MatchToken. Before I add visualization for the other Inst classes, I wanted to make sure, the tok field in Thread gives the index of the token in the sentence right? For example, tok = 0 in "John eats cake" refers to John. The visualization I pushed works with that understanding.

Nick

On Sat, Mar 9, 2024 at 9:44 PM Nick Avalani @.***> wrote:

Understood. Thanks

On Sat, Mar 9, 2024 at 9:27 PM Keith Alcock @.***> wrote:

External Email

@navalani https://github.com/navalani, you will eventually need the tok field of the Thread trait. I believe this keeps track of what token (of a sentence) is being matched against which Inst (of the rule). We might eventually keep a list of tokens that did or didn't match each Inst so that the user can verify that their rule is working as expected. For the moment we're still wanting to make sure the user and computer understand the rule in the same way based on the visualization of the Insts before they even start working. One thing that we're debugging is an incorrectly stated rule that doesn't mean what the rule writer thinks it does. We're more or less showing the assembly language that their code (rule) produced with enough clues so that they can verify it or at least do a sanity check.

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#issuecomment-1987081469, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCCR2VL5OTBAJCHGS7A2C33YXPOKLAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGA4DCNBWHE . You are receiving this because you were mentioned.Message ID: @.***>

kwalcock commented 4 months ago

@navalani, it's cool to see the "B-FOOD matches cake". That will give the user some indication of what is happening. The new code in ThompsonVM.stepSingleThread will probably become relevant soon. Things happening in that area are probably more complicated than OdinStarter would leave one to believe. Try the sentence John eats cake and Jane eats pain au chocolat. for a preview of the complications. That B-FOOD will end up matching pain and the record of cake will have been lost. The val matchTokens = new HashMap[String, Int] will need to be a more complicated mechanism for recordkeeping. It will additionally need to keep track of the sentence, because the one that is being worked on in stepSingleThread is not necessarily that one that is being used to make the visualization in OdinStarter (i.e., mentions.head.sentenceObj). However, that's a worry for later.

In the very short term, can you adjust the recursion in OdinStarter so that the output does not keep repeating the name of the extractor (rule)? Right now it is

There was an extractor: foods-from-lexicon - Inst: 1. SaveStart(--GLOBAL--)
1. SaveStart(--GLOBAL--)
There was an extractor: foods-from-lexicon (Next) - Inst: 2. MatchToken(B-FOOD matches pain) -> 3
2. MatchToken(B-FOOD matches pain) -> 3
There was an extractor: foods-from-lexicon (Next) (Next) - Inst: 3. Split.  Check out my LHS and RHS!
There was an extractor: foods-from-lexicon (Next) (Next) (LHS) - Inst: 4. MatchToken(I-FOOD matches chocolat) -> 3
4. MatchToken(I-FOOD matches chocolat) -> 3
There was an extractor: foods-from-lexicon (Next) (Next) (RHS) - Inst: 5. Pass
There was an extractor: foods-from-lexicon (Next) (Next) (RHS) (Next) - Inst: 6. SaveEnd(--GLOBAL--)
6. SaveEnd(--GLOBAL--)
There was an extractor: foods-from-lexicon (Next) (Next) (RHS) (Next) (Next) - Inst: 0. Done

but it would be much more understandable like

There was an extractor: foods-from-lexicon
1. SaveStart(--GLOBAL--)
2. MatchToken(B-FOOD matches pain)
3. Split.  Check out my LHS and RHS!
    (LHS) 4. MatchToken(I-FOOD matches chocolat) -> 3
    (RHS) 5. Pass
    6. SaveEnd(--GLOBAL--)
    0. Done

There was an extractor: person-from-lexicon
1. SaveStart(--GLOBAL--)
etc.

or something similar. Numbering, spacing, reference to other numbers would all be very useful.

navalani commented 4 months ago

Just pushed some code that has a more understandable visualization. I tried "John eats cake and Jane eats pain au chocolat." and I see what you mean. The record of cake is lost. Let me try using a HashMap with an array of Integers as the value.

Nick

On Sun, Mar 10, 2024 at 10:22 PM Keith Alcock @.***> wrote:

External Email

@navalani https://github.com/navalani, it's cool to see the "B-FOOD matches cake". That will give the user some indication of what is happening. The new code in ThompsonVM.stepSingleThread will probably become relevant soon. Things happening in that area are probably more complicated than OdinStarter would leave one to believe. Try the sentence John eats cake and Jane eats pain au chocolat. for a preview of the complications. That B-FOOD will end up matching pain and the record of cake will have been lost. The val matchTokens = new HashMap[String, Int] will need to be a more complicated mechanism for recordkeeping. It will additionally need to keep track of the sentence, because the one that is being worked on in stepSingleThread is not necessarily that one that is being used to make the visualization in OdinStarter (i.e., mentions.head.sentenceObj). However, that's a worry for later.

In the very short term, can you adjust the recursion in OdinStarter so that the output does not keep repeating the name of the extractor (rule)? Right now it is

There was an extractor: foods-from-lexicon - Inst: 1. SaveStart(--GLOBAL--)

  1. SaveStart(--GLOBAL--) There was an extractor: foods-from-lexicon (Next) - Inst: 2. MatchToken(B-FOOD matches pain) -> 3
  2. MatchToken(B-FOOD matches pain) -> 3 There was an extractor: foods-from-lexicon (Next) (Next) - Inst: 3. Split. Check out my LHS and RHS! There was an extractor: foods-from-lexicon (Next) (Next) (LHS) - Inst: 4. MatchToken(I-FOOD matches chocolat) -> 3
  3. MatchToken(I-FOOD matches chocolat) -> 3 There was an extractor: foods-from-lexicon (Next) (Next) (RHS) - Inst: 5. Pass There was an extractor: foods-from-lexicon (Next) (Next) (RHS) (Next) - Inst: 6. SaveEnd(--GLOBAL--)
  4. SaveEnd(--GLOBAL--) There was an extractor: foods-from-lexicon (Next) (Next) (RHS) (Next) (Next) - Inst: 0. Done

but it would be much more understandable like

There was an extractor: foods-from-lexicon

  1. SaveStart(--GLOBAL--)
  2. MatchToken(B-FOOD matches pain)
  3. Split. Check out my LHS and RHS! (LHS) 4. MatchToken(I-FOOD matches chocolat) -> 3 (RHS) 5. Pass
    1. SaveEnd(--GLOBAL--)
    2. Done

There was an extractor: person-from-lexicon

  1. SaveStart(--GLOBAL--) etc.

or something similar. Numbering, spacing, reference to other numbers would all be very useful.

— Reply to this email directly, view it on GitHub https://github.com/clulab/processors/pull/782#issuecomment-1987654070, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCCR2VPKNVDXDSEHELVX7JDYXU5QPAVCNFSM6AAAAABC6OFGN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGY2TIMBXGA . You are receiving this because you were mentioned.Message ID: @.***>

kwalcock commented 4 months ago

@navalani, the updated formatting looks much better (IMO). A novice rule writer can probably learn fairly quickly to follow it. I do notice that some consecutive numbers are missing and that the Done has disappeared, so some logic is not yet quite right. If this was less experimental, unit tests would be written to vouch for its correctness, but for now it's just eyeballs.

If you are already looking at the HashMap that is stored in the ThompsonVM, I would suggest moving it to the Debugger class instead. It would also be good to take another look at #769. You might notice that each Inst gets run/processed/visited numerous times. I see debugDoc, debugLoop, debugExtractor, debugSentence, debugTokenInst. These Doc, Loop, Extractor, Sentence, TokenInst form a context in which an Inst might match a token or not. I am leaning towards making something like that essentially the key to your HashMap and performing queries on it. It would be mini database. After that extractorEngine.extractFrom(document) in OdinStarter, one would say to the Debugger, possibly for the sake of the visualization, but maybe more generally, "You just extracted from a document. I'm interested in that 5th sentence. There's that foods-from-lexicon rule that didn't seem to work right. The B-FOOD Inst is acting up. Which tokens did it match?" This is similar to what happens in

  for (extractor <- extractors)
    visualize(extractor, sentence)

but for a more complex and realistic situation. There's more to say about it, perhaps at a meeting. @MihaiSurdeanu will likely have an opinion based on experience (while I'm just guessing :-)