Closed shenxiangzhuang closed 3 months ago
This PR contains major changes to implement the feature in #112 . If you have any suggestions about the implementation, I'd love to make further improvements. Then, I will add the required docs and tests in this PR or a new one.
I believe it would be useful to have it in Crowd-Kit, but I have two comments due to the significance of the proposed changes.
Thanks for your timely reply!
The implementaion is applied only once, and just as you said in the first M-step. Here the new defination of _m_step
take initial_error
as a optional parameter, and we only pass the initial_error
at the first M-step(errors = self._m_step(data, probas, initial_error)
), in the inner iterations we just ignore it(pass None
) to _m_step
method(errors = self._m_step(data, probas)
), so it works just as before.
Sure, I would like to add tests for this functionality. In addition, I will also add some usage examples in docs to make it easier to use.
It would be great to have docs and tests. Thank you for the clarifications. This new parameter is only applied at the very first iteration. I am wondering why this code performs an addition instead of an assignment. We already estimate these matrices from data, and right after that, we adjust them with initial_error
.
It would be great to have docs and tests. Thank you for the clarifications. This new parameter is only applied at the very first iteration. I am wondering why this code performs an addition instead of an assignment. We already estimate these matrices from data, and right after that, we adjust them with
initial_error
.
Great observation! The addition serves to address the issue of incomplete initialization error matrices, particularly when conducting truth inference in a production environment. Essentially, in such environments, we might not have all the workers' initialization error matrices available for the truth inference process.
To clarify, the worker's initialization error matrix is more akin to a "count matrix" rather than a "probability matrix" (I will provide additional documentation on this later). It resembles the errors in the M-step before the averaging process.
Therefore, the addition operation on the error matrix functions as an accumulation. We combine the historical error matrix counts of the worker with the current task's error matrix counts to obtain an accumulated error matrix. In cases where we lack the initialization error matrices for certain workers, their error matrices remain unchanged. However, when available, we obtain an accumulated error matrix, which will later be averaged to derive the final "probability matrix."
The docs preview:
I'd still prefer to have the option to assign the passed initial_error
instead of adding it to the pre-computed error matrices. It might be a boolean or string literal option adjusting the approach.
For example, it might be useful for reproducibility to run the new aggregation with the same matrices as estimated before without making the new initialization of them.
Could you please describe your scenario so I can understand it well?
Sorry for the unclear illustration about the motivation of this implementation.
Ours scenario is pretty simple actually: all the T
tasks are annotated by N
workers(every one only annotates part of the tasks, not all), while M
(<N
) workers have initial_error
because they have history annotation records data on this kind task. The others N-M
workers don't have initial_error
, because this is their first time to do these tasks.
In this case, if we simply assign the passed initial_error
, we only got the error matrices for the M
workers, which leads to an uncompleted errors_
.(I think the errors_
should contains all the workers' error matrices to make the EM algorithm work. If I was wrong, correct me please.)
So, what we can do to use the N-M
workers' initial error matrices and want a completed errors_
? We choose to use the addition to do this which works more like a weighted sum of the initial error matrix with priori error matrix(evaluated from the task data). For the M
workers, their error matrices(after first m-step) are weighted sum of the initial error matrix and priori error matrix, the weights come from the annotation task count; for the N-M
workers, their error matrices is just priori error matrix, which is set the weight of priori error matrix to 1.
I'd still prefer to have the option to assign the passed
initial_error
instead of adding it to the pre-computed error matrices. It might be a boolean or string literal option adjusting the approach.
I think the idea of an extra method parameter is a good one. When using the assign method, in addition to making sure initial_error
is "complete", there should be some data checking logic, as discussed earlier.
I see, thank you. Let's add it and I'll be happy to merge it.
I have added initial_error_strategy
for the difference usages of initial_error
with examples, docs and tests. Please review again.
PS: The ci still fails due to some mypy checking errors, maybe we can fix them in another pr after figure out the reasons.( #115 #116)
These issues have been fixed in main
.
These issues have been fixed in
main
.
Thanks for your quick fix! The ci works well now. Please review again.
Close #112
Checklist