evt-project / evt-viewer-angular

Edition Visualization Technology version 3
GNU Affero General Public License v3.0
21 stars 16 forks source link

Non-sequential parsing of TEI elements #234

Open RobertoRDT opened 7 months ago

RobertoRDT commented 7 months ago

At the time of writing, EVT 3 parsing of TEI elements is sequential, element by element, which makes it problematic to refer to elements that are at a later position than the currently parsed element. For example, to avoid the problem of data redundancy in TEI encoding for authorial philology, a solution based on the copyof attribute would be very effective:

This would also be a very useful mechanism in many other situations, e.g. to put together text based on portions of other separate documents and other stand-off use cases. However, since EVT 3 proceeds sequentially, when it encounters <del copyof="#MS_12_23"/> it would not yet have parsed the contents of <mod change="strato-0" xml:id="MS_12_23"> (authorial philology encoding proceeds in reverse chronological order, most recent text first) and thus that node would not yet exist. Another currently impossible use case: retrieving the content of a footnote in the <back>, or of a bibliographic entry also in the <back>.

Possible solutions: 1) second round of parsing with only an actual reference to the external element, which, although it would increase the start-up time if the functionality were used intensively, would not have a particular impact on performance; 2) on-demand parsing of the requested node not yet encountered with a copy of the same in its own data structure (and thus duplication + memory costs).

laurelled commented 4 months ago

I've had the time to analyse the complexity of implementing a second round of parsing, based on the current situation. For those who lack the time to read thoroughly, in short my opinion is that its implementation would require a huge effort in terms of refactoring the existing code. We should discuss together on what road to take.

I'll be redundant for the sake of clarity. Currently, most of the parsing is being done by specific classes that implements the Parser interface, which requires to implement the parse method. I'll paste here one of the classes for reference:

@xmlParser('sic', SicParser)
export class SicParser extends EmptyParser implements Parser<XMLElement> {
    attributeParser = createParser(AttributeParser, this.genericParse);
    parse(xml: XMLElement): Sic {
        const attributes = this.attributeParser.parse(xml);
        const { type } = attributes;

        return {
            type: Sic,
            sicType: type || '',
            class: getClass(xml),
            content: parseChildren(xml, this.genericParse),
            attributes,
        };
    }
}

There's also that @xmlParser('sic', SicParser), which comes pretty handy because it maps the tag name to its parser. The combination of both the interface implementation and the mapping is great for a generic double-pass parsing. In fact, we would just need to get the tag name, retrieve the corresponding parser class and call the parse method which any of them has.

However, some parsing is not done following this method. For example:

Refactoring those problems is not straight-forward and would require thorough discussions with other devs, and from my understanding right now it's not really a good time for that. We should still discuss how to proceed, as it's a pretty important feature for the stand-off apparatus implementation.

RobertoRDT commented 4 months ago

[A summary of all subsequent comments so that they aren't lost in Slack]

RobertoRDT As you noticed, this is pretty crucial for the stand-off markup processing needed to support DEPA, so we should try to find a (perhaps not perfect) solution right now.

Lorenzo Bafunno So, the main topics that would start a discussion would be:

@Davide I. Cucurnia got around it by duplicating some elements, the implementation can be found in analogue-parser.ts

RobertoRDT Could you parse specific stand-off elements (<listApp> f.i., but also <div> elements in the <back> with notes, hotspots etc.) before all the other ones, so that the necessary information is already available when you get to the inline markup needing it?

Because, if I got it right, the problem is that when I have a list in the <back> and something connected to that list in the <body>, when EVT parses the <body> it still lacks the relevant info available in the <back>. Sorry if this doesn't make much sense :sweat_smile:

@Davide I. Cucurnia Yes like briefly addressed weeks ago the other solution is an on-demand parsing of the required (=referred) element when needed, similar to what's done in the analogue and source parsers. Memory-wise a copy of the referred element is however needed in order to show it in the page, so we cannot avoid it. The double round parsing solution would also inflate the boot time of the app...

Could you parse specific stand-off elements (<listApp> f.i., but also <div> elements in the <back> with notes, hotspots etc.) before all the other ones, so that the necessary information is already available when you get to the inline markup needing it? Yes, this solution sounds similar to what it's currently developed in the analogue and source parsers, which is the on-demand parsing.

RobertoRDT Would it be safe to say that if I put everything in the <standOff> element before <text> there would be little need to copy the referred elements in memory? As an interim solution to give us time to look for a proper fix.

https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-standOff.html

As you can see in this example <standOff> can precede the main <text>: https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-standOff.html#index-egXML-d54e131382

Lorenzo Bafunno Thank you for the link, I didn't understand at first. Still, I don't think that would change the situation. A copy of the referred element is still needed. The problem is that when a parser class parses the XML/TEI code, it's not provided with a context of what has been parsed previously, neither it knows what is yet to be parsed.

I'll try to provide an example

RobertoRDT OK that means I got it wrong because I thought the problem lies when the referred element is down in the TEI document.

Lorenzo Bafunno So, currently, from my limited understanding (Davide I. Cucurnia correct me if i'm wrong), everytime we want to extract information needed for a component, we do these steps:

Ideally, we could retrieve data from both the <body> and the <back> and merge it together in one structure. That's no problem (easier said than done). But that creates duplication.An example would be this:

<body>
 <w xml:id="w1">hello</w>
 <w xml:id="w2">world</w>
</body>
<back>
 <app from="w1"> <rdg wit="#c">hallo</rdg> </app>
</back>

Let's imagine we store a combination of <w> and the corresponding app and let's call the resulting structure A. That would be great for the DEPA. But there is already a parser that provides information about the <app>, so the information stored in the type A is redundant and takes up "useless" memory.

Sorry if it took me that long but I needed time to make it clear, hope it makes sense :sweat_smile:

Andrew Forsberg

Let’s imagine we store a combination of <w> and the corresponding app and let’s call the resulting structure A. That would be great for the DEPA. But there is already a parser that provides information about the <app>, so the information stored in the type A is redundant and takes up “useless” memory @Lorenzo Bafunno

— Apologies in advance, as this might be an overly naive suggestion, but just in case — would an initial parse of <back> to create a unique set of from values help at all? The latter could be used with a hasBackRef() helper function. Then, during the full first (and only) parse a quick and almost free check on that set would identify whether there were elements in <back> that needed to be taken into account. (nb: I haven’t had a chance to check @Davide I. Cucurnia’s analogue-parser.ts yet. That’s next :slightly_smiling_face:

ajf-ajf commented 4 months ago

Thanks @Lorenzo Bafunno (@laurelled), for clarifying so many details on Tuesday. I’m still not proposing this as a ‘sure fire’ solution, but it might work as an interim patch of sorts. The idea is:

  1. Identify the app node and scan it first.
  2. Create a simple set that contains only unique from variable strings.
  3. When parsing a node in the document, use a helper check function to test whether the string exists. a. If it does, collect and parse whatever’s needed from the app node; or b. If it doesn’t, onwards and upwards, we’re done here.

It’s not great, and it means another sort-of global/config type variable hanging around. On the other hand, as an interim solution — it’s only a set containing a few strings. The check against it should be very cheap in terms of resources. Anyhow, that’s the basic idea. Any thoughts?

laurelled commented 3 months ago

As I could experience in these weeks, Andrew's solution is easily implementable. However, another problem has risen. That is how the parsing and the following visualization is done.

I'll paste the information found in the parsing wiki:

After reading the source file indicated in the proper configuration parameter, EVT parses the structure of the edition. At the moment, everything is based on pages (this will probably change when we will add the support for critical edition and pageless editions). A page is identified as the list of XML elements included between a <pb/> and the next one (or between a <pb/> and the end of the node containing the main text, which is the <body> in the case of the last page). Each page is represented in the EVT Model as a Page:

interface Page {
  id: string;
  label: string;
   originalContent: OriginalEncodingNodeType[];
   parsedContent: Array<ParseResult<GenericElement>>;
}

The content of each page is therefore represented as an array of object retrieved by parsing the original XML nodes. After parsing the structure, for each page identified, we then proceed to parse all the child nodes, by calling the parse method of the GenericParserService.Parsers are defined in a map that associates a parser with each supported tagName. This map is retrieved by the generic parsing function which chooses the right parser based on the node type and its tagName. If a tag does not match a specific parser, the ElementParser, which does not add any logic to the parsing results, is used. Tags and parsers are divided by belonging TEI module.

That's great, except for the fact that each resulting type is associated with a component through the ContentViewerComponent in a very general way:

This is a dynamic component that takes a ParsedElement as input and establishes which component to use for displaying this data based on the type indicated in the type property. This type is used to manage the component register, to be accessed for dynamic compilation, and also the type of data that the component in question receives as input

The content viewer takes the Page parsedContent attribute and, for each element of the array, associate it with a component and visualize it. But with DEPA this way of handling cause problems, because there can be apparatuses that overlay each other. They (i) wouldn't be a separate entity and (ii) I can't surround them with a new parent tag without ruining the XML encoding.

The only solution I can think of is creating a fake surrounding tag to handle it in a specific component, "imitating" the way EVT2 handled it.