Closed gerasy1987 closed 3 years ago
Adding vs Multiplying probabilities:
Currently sample_tls constructs sampling weights as the dot product between (unit i observed at location j) and (sampling probability location j). i.e. given three locations we calculate the weight for unit i as p(i1) + p(i2) + p(i3). This assumes that events observing unit i at each location are mutually exclusive. Is this a reasonable assumption?
Current Method:
@till-tietz My sense on why this is producing wrong weights right now.
@gerasy1987
point 1 makes a lot of sense. This issue does not arise in the method of sampling a set number of units per cluster (implemented in the pull request relating to issue #37). For the original sampling method this should be an esay fix. We essentially just want each unit-location pair to be a row in a data.frame and sample from that right? (Or rather create a sampling weight vector based on number of time locations a unit is present at).
I think I see what you mean with point 2. We essentially just have the first piece of the weight equation (i.e. probability of being present at a location). In order to calcualte the weight do you then propose something like: p(unit present at time location) number of time location sampling rounds (probably just 1) location size?
pushed a proposed solution to point 1 (accounting for increased sampling probability due to presence in multiple locations) in branch sample-tls_sampling_probability
@till-tietz thanks. I'm reviewing the new branch now.
On your response to point 1, I'm not sure if I follow why we don't encounter that issue in case of fixed sampling per location, since in that case, we have a chance to sample the same person subject in multiple locations. So we need a fix for that case as well.
On response to the second issue, yes, I think that's correct; we only need to account in the case of a fixed number of units per location since for probability of being sampled for each hidden pop member will vary in each area (due to possible location size differences)
Hi both
a bit behind on this but are we taking account of the fact that people are only sampled once (I presume).
If qi is probability a location will be sampled and ri is probability a unit is sampled in a location, and i indexes places a person visits then: 1 - prod_i(1 - qi ri)
However this assumes locations sampled independently and also that likelihood of being present is independent which may not be true?
For such tricky sampling probabilities we can verify directly via repeated sampling that individuals are indeed sampled according to calculated probabilities.
On Sun, 3 Oct 2021, 22:22 Georgiy Syunyaev, @.***> wrote:
@till-tietz https://github.com/till-tietz thanks. I'm reviewing the new branch now.
On your response to point 1, I'm not sure if I follow why we don't encounter that issue in case of fixed sampling per location, since in that case, we have a chance to sample the same person subject in multiple locations. So we need a fix for that case as well.
On response to the second issue, yes, I think that's correct; we only need to account in the case of a fixed number of units per location since for probability of being sampled for each hidden pop member will vary in each area (due to possible location size differences)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gerasy1987/hiddenmeta/issues/26#issuecomment-933019260, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBE57N5POWWZDP7O3EYBS3UFC3QDANCNFSM5EKMJFXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@macartan
Yes, we only sample each subject once, but we do not allow to add subjects (i.e., if we encounter one subject in two different locations, we count them as one, and thus the resulting sample size can be slightly smaller).
As for the weight, yes, we should actually use this formula instead of what we do now.
On the independence of location sampling/subject presence, I agree the way we implement it now is not very flexible, but I think we need to make this simple structure work first. Mainly because it seems to be very hard to specify the location sampling and/or subject presence interdependence systematically, we can probably implement this later.
To follow up, I prefer to rewrite the whole sampling procedure using the stacking approach where we stack all sampled locations on top of each other and then sample from this dataset. This actually allows us to directly implement visibility into this sampling that is currently largely omitted. Then we can collapse across repeated subjects, avoiding the messy way we currently record multi-location presence. Finally, we can calculate the weights using the approach @macartan proposed. I will take a stab at this.
@till-tietz thanks. I'm reviewing the new branch now.
On your response to point 1, I'm not sure if I follow why we don't encounter that issue in case of fixed sampling per location, since in that case, we have a chance to sample the same person subject in multiple locations. So we need a fix for that case as well.
On response to the second issue, yes, I think that's correct; we only need to account in the case of a fixed number of units per location since for probability of being sampled for each hidden pop member will vary in each area (due to possible location size differences)
@gerasy1987
I probably misunderstood the issue, but I thought the problem was that sampling subjects with equal probability and then recording multi location presence didn't properly reflect the actual sampling procedure/ increased probability of sampling a unit that e.g. is present at all locations. What I meant to say is that the method currently implemented for fixed sampling does account for this as it samples in each location independently. Maybe we can discuss all this via zoom tomorrow.
@till-tietz: big rewrite of TLS sampling in 13be661 and 9ef58e6 attempting to solve weighting issue. Also, we now allow to sample all units not just hidden group if hidden_var = NULL
is specified. Would be great if you check if this looks ok
@gerasy1987 Will review everything tonight + tomorrow.
The revised sample_tls function looks good to me. The purrr::when function was a really handy addition. I'll run some tests locally just to double check things.
One small thing: in target_cluster_type <- match.arg(target_cluster_type) we might have to specify NULL as the first argument as match.arg has no default value for its arg argument I think.
My sense is that the weights constructed by
sample_tls()
function are incorrect and need to be checked, since the estimates coming from TLS samples are off even in fairly straightforward cases