Toloka / crowd-kit

Control the quality of your labeled data with the Python tools you already know.
https://crowd-kit.readthedocs.io/
Other
213 stars 16 forks source link

feat(ds): add init_error for worker init error matrix #114

Closed shenxiangzhuang closed 3 months ago

shenxiangzhuang commented 3 months ago

Close #112

Checklist

shenxiangzhuang commented 3 months ago

This PR contains major changes to implement the feature in #112 . If you have any suggestions about the implementation, I'd love to make further improvements. Then, I will add the required docs and tests in this PR or a new one.

dustalov commented 3 months ago

I believe it would be useful to have it in Crowd-Kit, but I have two comments due to the significance of the proposed changes.

  1. If I remember other implementations correctly, the initial estimate of worker errors is usually applied once at the beginning, effectively replacing the first M-step. The current approach adjusts the error matrices at each iteration.
  2. Would it be possible to cover this new functionality with tests?
shenxiangzhuang commented 3 months ago

Thanks for your timely reply!

  1. The implementaion is applied only once, and just as you said in the first M-step. Here the new defination of _m_step take initial_error as a optional parameter, and we only pass the initial_error at the first M-step(errors = self._m_step(data, probas, initial_error)), in the inner iterations we just ignore it(pass None) to _m_step method(errors = self._m_step(data, probas)), so it works just as before.

  2. Sure, I would like to add tests for this functionality. In addition, I will also add some usage examples in docs to make it easier to use.

dustalov commented 3 months ago

It would be great to have docs and tests. Thank you for the clarifications. This new parameter is only applied at the very first iteration. I am wondering why this code performs an addition instead of an assignment. We already estimate these matrices from data, and right after that, we adjust them with initial_error.

shenxiangzhuang commented 3 months ago

It would be great to have docs and tests. Thank you for the clarifications. This new parameter is only applied at the very first iteration. I am wondering why this code performs an addition instead of an assignment. We already estimate these matrices from data, and right after that, we adjust them with initial_error.

Great observation! The addition serves to address the issue of incomplete initialization error matrices, particularly when conducting truth inference in a production environment. Essentially, in such environments, we might not have all the workers' initialization error matrices available for the truth inference process.

To clarify, the worker's initialization error matrix is more akin to a "count matrix" rather than a "probability matrix" (I will provide additional documentation on this later). It resembles the errors in the M-step before the averaging process.

Therefore, the addition operation on the error matrix functions as an accumulation. We combine the historical error matrix counts of the worker with the current task's error matrix counts to obtain an accumulated error matrix. In cases where we lack the initialization error matrices for certain workers, their error matrices remain unchanged. However, when available, we obtain an accumulated error matrix, which will later be averaged to derive the final "probability matrix."

shenxiangzhuang commented 3 months ago

The docs preview: image

dustalov commented 3 months ago

I'd still prefer to have the option to assign the passed initial_error instead of adding it to the pre-computed error matrices. It might be a boolean or string literal option adjusting the approach.

For example, it might be useful for reproducibility to run the new aggregation with the same matrices as estimated before without making the new initialization of them.

Could you please describe your scenario so I can understand it well?

shenxiangzhuang commented 3 months ago

Sorry for the unclear illustration about the motivation of this implementation.

Ours scenario is pretty simple actually: all the T tasks are annotated by N workers(every one only annotates part of the tasks, not all), while M (<N) workers have initial_error because they have history annotation records data on this kind task. The others N-M workers don't have initial_error, because this is their first time to do these tasks.

In this case, if we simply assign the passed initial_error, we only got the error matrices for the M workers, which leads to an uncompleted errors_.(I think the errors_ should contains all the workers' error matrices to make the EM algorithm work. If I was wrong, correct me please.)

So, what we can do to use the N-M workers' initial error matrices and want a completed errors_? We choose to use the addition to do this which works more like a weighted sum of the initial error matrix with priori error matrix(evaluated from the task data). For the M workers, their error matrices(after first m-step) are weighted sum of the initial error matrix and priori error matrix, the weights come from the annotation task count; for the N-M workers, their error matrices is just priori error matrix, which is set the weight of priori error matrix to 1.

shenxiangzhuang commented 3 months ago

I'd still prefer to have the option to assign the passed initial_error instead of adding it to the pre-computed error matrices. It might be a boolean or string literal option adjusting the approach.

I think the idea of ​​an extra method parameter is a good one. When using the assign method, in addition to making sure initial_error is "complete", there should be some data checking logic, as discussed earlier.

dustalov commented 3 months ago

I see, thank you. Let's add it and I'll be happy to merge it.

shenxiangzhuang commented 3 months ago

I have added initial_error_strategy for the difference usages of initial_error with examples, docs and tests. Please review again.

PS: The ci still fails due to some mypy checking errors, maybe we can fix them in another pr after figure out the reasons.( #115 #116)

dustalov commented 3 months ago

These issues have been fixed in main.

shenxiangzhuang commented 3 months ago

These issues have been fixed in main.

Thanks for your quick fix! The ci works well now. Please review again.