coastal-science / HALLO-annotation

codebase for the HALLO annotation tool
GNU General Public License v3.0
0 stars 0 forks source link

Add type check or validation to the Call Type of annotation. #11

Closed yue-su closed 1 year ago

yue-su commented 2 years ago

@fsfrazao Fabio, I need some input from you regarding how we would implement this feature. A few examples would be really helpful! Thanks!

fsfrazao commented 2 years ago

@yue-su I think for most fields we could:

-exclude any trailing spaces -exclude the character used as a separator when exporting annotations to a text file (';' for csv, tab for tsv,etc)

Here are some examples of values for call typer. (Basically, one capital letter and two numbers. It can also have multiple values separated by "/" or a "?" in the end.

call type -S17 -S17? -S16/S17 -S01 -N04

Have a look at this document for examples in the other fields: https://docs.google.com/document/d/1VbERHwoyIjx73wj_aP6ta1ngWbCiHb4wpC_0OD5z_ZY/edit

yue-su commented 2 years ago

@fsfrazao There are 20 possible values in the call type field. Would it be reasonable just to make these values a drop-down selection input field?

fsfrazao commented 2 years ago

@Yue, I think that would be a fine solution, especially with the newly implemented feature that automatically fills the form with the values of the previous annotations.

Perhaps a dropdown that can also be filled by typing would be a good option. Something like https://react-select.com (see the first example with the "RTL" option enabled)

fsfrazao commented 2 years ago

@oliskir and @scottveirs, I'd like your opinion on this issue.

Our intention is to reduce the variability of values in the annotations. If all fields accept free text, the exported annotations may have several variations of the same value ("s17", "S 17" , "S-17",etc) or even typos ("S!7"). So I think some level of enforced standardization would be good, but to question is to what extent.

These are the two options we are considering:

1) Make every field (except comments) and drop-down with pre-defined values. 2) Same as (1), but also allows the user to add new values 3) Same as (2), but format new values to very specific rules

Option (1) would be more strict and, if the user really doesn't find the value they need for a field, they could select "other" and use the comments field to specify. This option would be hard for fields like call type, which can have combinations (we would have to pre-define every combination)

Option (2) would be more flexible but would leave more room for mistakes. For example, if we only pre-define the individual call types and the user needs to create a new value for S16/S17, they might make a typo.

Option (3) force any new values into a pre-defined format, according to some rules. For example, the call type field can have values that are made by a single capital letter followed by two digits and can be separated by "/" or followed by "?". So if someone tries to enter "s-16/s-17" the form would not accept it. This is my favorite option, but it would require that we think carefully (and collectively) about the rules. For instance, the rules I used above for the call type field would not allow the Humpback nomenclature used by Emily, so the rules would need to be changed if we want to accommodate those (and I think we should!).

If we go with option (3), we could define these rules in the google doc that defines the annotation guidelines.

What are your thoughts?

oliskir commented 2 years ago

This is not an easy one.

What do you mean when you say that a call type can "have combinations"?

In a situation where two or more calls are overlapping, I think the annotator should always draw separate boxes around each call. With this approach, multiple call types would only be assigned to the same box in cases of doubt.

Instead of having the user type "?" for uncertainty, would it be possible to create a separate field called confidence with only two values, certain or uncertain?

Instead of having the user type "/" to separate multiple possible assignments, would it be possible to add a button to achieve the same result? (e.g. there would be a button called 'add call-type assignment' or something like that. Of course with the understanding that this should be used only in cases of doubt and not to capture multiple calls within the same box.)

I agree that it would be desirable to include Emily's call names!

fsfrazao commented 2 years ago

@oliskir, when I said "combinations" I meant the uncertain cases you mentioned. According to the annotation guidelines, the "valua1/value2" can appear in several fields (species, ecotype, call type and pod).

There is a field called confidence, with three options to choose from (low, medium, high). We could just stop using "?" and use this field instead. The annotation guidelines currently say that both are acceptable, maybe the "?" could be eliminated and uncertainty could be exclusively indicated by the confidence field.

Could you elaborate more on your idea of the "/" button? Is the idea that the options for the species, ecotype, pod, and call type field are fixed and, if you click the button, you're allowed to select a second value for that field? And two be the maximum number of values, right? Or would you be allowed to say "KW/HW/PWSD" for example?

oliskir commented 2 years ago

THanks for clarifying, @fsfrazao .

I think it would make sense to stop using "?" and instead using the 'confidence' field with options low, medium, high. Such a field would then be needed at all levels of labelling resolution, i.e., there would be fields 'species_confidence', 'ecotype_confidence', 'call_type_confidence', etc. I think this would be my preferred design.

As for the "/" button - Yes, my idea was that the options for species, ecotype, pod and call type should be fixed, and if you click the button you are allowed to select a second value. We could also for more than two choices, but I don't think this would be two common a need, so if it makes the implementation more complicated, we can limit ourselves to two choices.

An underlying assumption here is that there be a fairly quick line of communication between the annotator and the admin, so that new options can be added with not too large a delay.

fsfrazao commented 1 year ago

@oliskir Thanks for expanding on your suggestions.

The confidence fields are quite easy to implement technically speaking, but I wonder if the interface will get a little cluttered. We can try and see. It would give a greater level of detail than the current guidelines, which associate the confidence level with the "highest" annotation level. I guess that's often the case, but your suggestion would allow the annotator to say they are sure about the call type but unsure about the pod, for example.

The "/" button requires a little more work to implement, especially depending on how the values end up in the annotation table. If two values are entered, should they appear as A/B in the annotations? It's easier if we limit ourselves to 2 choices.

When it comes to your underlying assumption, we do have a problem: There's no support for an admin to edit these values, it needs to be done by the developer and the instance needs to be redeployed. Although I agree it's a good feature and we are already building it into MAIPL, it would require some substantial changes to the HALLO annotation tool.

Preferably, the list of values for each field would be well thought-through and defined. The annotation guidelines already have a good part of these values, so my suggestion would be to expand it to actually include all of the values we want to allow, and then the annotation tool will follow the same specifications. Occasional updates would of course still be possible, but shouldn't be frequent.

We can make the changes to allow an admin to edit these values, but it will take considerably longer.

scottveirs commented 1 year ago

A few thoughts after a bit of beta-testing:

  1. I don't think there's a use case where call type should have more than one value. An annotator should either select a signal type for the active bounding box and indicate their confidence in the classification, or they should mark it unknown (or variable?) and maybe add some thoughts/guesses in the comment field. For example, I found a clear S01 call, labelled it as S01 and indicate high confidence, but added a comment like: "Either last part of S01 is missing or SNR is such that I can't hear it well. This makes this example of an S01 seem a little shorter than expected from the catalog samples."

  2. For the Ecotype and Pod fields, I'm wondering if we could avoid the need for a "/" button, but just providing all combinations of likely tags, i.e. for Pod the dropdown could contain J, K, L, JK, JL, KL, and JKL. As an aside, the confidence in the pod is likely to stay the same if there's context from a sighting network, but when there's no visual context it could change for a whole batch based on a final assessment of all call types present. I might suggest that there be an option to set Pod type at the batch level when high-confidence visual context (i.e. photo ID) is available and inference of pod from call types isn't necessary for each annotation.

  3. I'm starting to like the idea of populating the drop-downs for species, ecotype, and signal type dynamically. For example, if I select Species = humpback, then the Ecotype field goes away (or is replaced by a DSP = Distinct Population Segment field?), and Call Type changes to Signal Type and is pre-populated with the 12 non-song vocalizations types in Emily's humpback catalog. Alternatively, if I select Species = KW and Ecotype = SRKW, then the Call Type menu gets all the SRKW call types that are in the current version of the Ford online catalog. In the short term, an expanded set of tags could be hard-coded once we've all agreed on transboundary Salish Sea or NE Pacific labels, e.g. via this Google sheet of signal dictionary candidates and/or the OrcaHello tag cloud which has a lot of noise and non-call signal labels.

oliskir commented 1 year ago

I like your suggestions, Scott.

Restricting call type to only one value would certainly simplify things on the ML development. In my view, ambiguous call type annotations (e.g. could be S01 or S04) are hard to use for model training, and are mainly helpful for model evaluation.

fsfrazao commented 1 year ago

I like Scott's suggestions as well.

I'll try to summarize the changes to the annotation form here and separate them into short-term and long-term goals.

Short-term changes (to be implemented in the next weeks).

1) The Species, Ecotype, Pod and Call type fields should have a predefined list of values that can be picked in a drop-down menu.

2) The Species field will accept more than one value from the list. The Pod and Ecotype fields will have every plausible combination as an option (J, K, L, JK, JL, KL, and JKL). The Call type field will only accept one single call value per annotation box.

3) Each of the Species, Ecotype, Pod and Call type will have a confidence selector associated with it.

Long-term goals

1) Dynamically populate the list of possible values for other fields depending on the value of the Species field

2) Add an option to set Pod (and pod confidence level) for the whole batch

3) Add an option to allow an admin to edit the list of values for the Species, Ecotype, Pod and Call type fields.

Is this a good summary of the desired changes?