Count how many sentences have specific layers and values in them

yanirmr commented 2 years ago

Hello,

While I find your project very useful, there is one task I couldn't accomplish. In order to count the number of sentences with a specific label or value, I try to find those that have both.

An example would be:

There are 100 sentences, but only half have been tagged with the POS layer's labels. How can I count how many sentences have been tagged? The information about the POS layer includes only the number of POS labels, but not the circulation of them throughout the whole document.

In the second example, the details are more specific. In the same setup as above, I wonder how to calculate the circulation of specific POS tags in the document at a sentence level. As an example, to answer the question "How many sentences contain interrogative?"

Thank you!.

simulacrum6 commented 2 years ago

Hi,

we are happy that you find the tool useful. In order to calculate the statistics you are looking for, you need to use the pandas dataframe that the views are based on. I hope the following examples help.

I assume you have a situation that looks similar to the following.

project = ... # your Project object

pos_layer = 'de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS'
feature = 'coarseValue'
feature_path = f'{pos_layer}>{feature}'
pos_annos = project.select(annotation=feature_path)

Question 1: How many sentences have been tagged with ANY POS label?

Actually, there is no convenient way to get this information with the current API. You would have to compare the number of sentences in the entire project against the number of sentences in the given POS View.

# first, we need to select the sentence annotation layer in the project. you can find it under project.layers
sentence_layer = 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence'
feature = 'id'
feature_path = f'{sentence_layer}>{feature}'
sentence_annos = project.select(annotation=feature_path)

Then we select the column with Sentence IDs in the underlying pandas.DataFrame and count the unique Sentence IDs. We do this for both Sentence annotations and POS Annotations, so we can compare the counts.

num_total_sentences = len(sentence_annos.data_frame['sentence'].unique())
num_pos_tagged_sentences = len(pos_annos.data_frame['sentence'].unique())
ratio = num_pos_tagged_sentences / num_total_sentences

Question 2: How many sentences have been tagged with a SPECIFIC POS label

To get this number, you can use the View.filter_sentences_by_label function. It will discard all sentences that do not have an annotation with the given labels from the View.

num_tagged_sentences = dict() 
for label in pos_annos.labels:
    filtered_annos = pos_annos.filter_sentences_by_label(label)
    num_tagged_sentences[label] = len(filtered_annos.data_frame['sentence'].unique())

print(num_tagged_sentences['ADJ'])

You can use pandas to calculate the ratios more conveniently.

import pandas as pd
ratio_tagged_sentences = pd.Series(num_tagged_sentences) / num_total_sentences
print(ratio_tagged_sentences)

I hope that answers your question?

yanirmr commented 2 years ago

Thank you very much for this detailed response!

yanirmr commented 2 years ago

BTW, I think this information should be included in the documentation or tutorial, it is helpful.

catalpa-cl / inceptalytics

Count how many sentences have specific layers and values in them #13

Question 1: How many sentences have been tagged with ANY POS label?

Question 2: How many sentences have been tagged with a SPECIFIC POS label