catalpa-cl / inceptalytics

An easy-to-use API for analyzing INCEpTION annotation projects.
16 stars 7 forks source link

Selection of a multi-sentence tag #21

Open yanirmr opened 2 years ago

yanirmr commented 2 years ago

Hi,

My objective was to select tags that could be applied to more than one sentence. The information that it returned was expected to include the entire range of tags. But, It appears that it returns only the first sentence included in the desired range.

The following code demonstrates the problem:

feature_path = f"{layer_name}>{feature_name}"
annotation_view = project.select(annotation=feature_path)
annotation_df = comments.data_frame
annotation_df .head()
simulacrum6 commented 2 years ago

Hi,

could you provide an example that demonstrate the use case at the annotation level? So two sentences, the span and name of the tag. I just want to make sure, I am understanding the problem correctly.

If you don't need the sentence annotations, a workaround would be to read your project documents as "one sentence per line" when importing them in INCEpTION. In this way, your entire document receives a single sentence annotation and you could annotate across sentence boundaries without running into the problem you were describing.

yanirmr commented 2 years ago

Yes, of course. I appreciate your assistance.

There are some linguistic phenomena that extend beyond the scope of a single sentence. As an example, one of the phenomena I am interested in is lists. In this case, if there is a list in the text, I mark the range of the list. As a result, I would like to extract the entire list's text.

There is another linguistic phenomenon (I am not working on it at the moment, but it does exist) known as rhetorical analysis. As a result, if you tag "explanation", "contradiction" and other rhetorical moves, the essay may be longer than one sentence.

Regarding your last suggestion - ignore the split sentence. It is a misleading option that causes a lot of performance issues in INCEpTION. This is something I try to avoid.

simulacrum6 commented 2 years ago

Alright, I see.

We are currently thinking about redesigning the internal representation for annotations. This will be a larger change and extend the types of annotation projects you will be able to handle with INCEpTALYTICS. When we first created the library, we designed it to work well with the primary use cases that we are dealing with. Now that people are using the library and finding value in it, we are trying to accommodate a wider range of use-cases. This will require larger changes and thus development time.

For now, I will add an issue to add a compatibility chart, so that users are aware which type of annotation projects they can use the library for until we address the current shortcomings.

As for your specific problem, there are two additional workaround that might work to resolve your issue.

The first one is to use the grouped_by provided by some methods of the View API. If you set that to ['begin', 'end', 'annotator'], source file and sentence boundaries will be disregarded and only spans and annotators are used as distinction between annotations. The second one is to create a custom View based on a reindexed Annotation DataFrame. The code to achieve this would look something like the following:

from inceptalytics.analytics import View

view = ... # the View of your multi-sentence annotations
dangerous_view = View(
    annotations = view.data_frame.set_index(['begin', 'end', 'annotator'],
    project = view.project,
    layer_name = view.layer_name,
    feature_name = view.feature_name
)

As mentioned, both of these workarounds might not work in your case of break unexpectedly, so use them at your own risk (if you use them).

yanirmr commented 1 year ago

Attached you will find two examples of cross-reference tags. The spans we get if we select them are only those that are included in the same sentence and not the spans that are cross-sentences.

It would be appreciated if you could assist us in this type of situation.

test.zip

reckart commented 1 year ago

Maybe it is just a matter of looking into how to use DKPro Cassis select to extract data from the CAS objects and put it into a Pandas frame? I guess inceptalytics has a default way of doing that, but I would assume it shouldn't be difficult to write custom code that uses inceptalytics to iterate over the data in a project but does the transformation in a different way?

zesch commented 1 year ago

There currently is a non-optimal (read: somewhat buggy ;) way in which we transform the CAS into the inceptalytics data frame. It is on the agenda to fix it, but as usual takes more time then expected.

reckart commented 1 year ago

@zesch what I'm trying to say is that there should probably be more than one way to do the transformation depending on the type of analysis one wants to make. A transformation supporting POS-tag analysis may look different from a transformation that wants to look at a co-reference network or at cross-sentence phenomena.

zesch commented 1 year ago

@reckart good point. Have to think about this a bit, but this might be a way around some conceptual problems we were facing.