check the weights in sample_tls()

gerasy1987 commented 3 years ago

My sense is that the weights constructed by sample_tls() function are incorrect and need to be checked, since the estimates coming from TLS samples are off even in fairly straightforward cases

till-tietz commented 3 years ago

Adding vs Multiplying probabilities:

Currently sample_tls constructs sampling weights as the dot product between (unit i observed at location j) and (sampling probability location j). i.e. given three locations we calculate the weight for unit i as p(i1) + p(i2) + p(i3). This assumes that events observing unit i at each location are mutually exclusive. Is this a reasonable assumption?

till-tietz commented 3 years ago

Current Method:

sample time locations with p proportional to proportion of hidden population at each time location (lines 38-49
randomly sample units within time locations (lines 50-80)
calculate unit sampling weights as dot product of ecountering unit at cluster and cluster sampling probability (lines 82-86)

gerasy1987 commented 3 years ago

@till-tietz My sense on why this is producing wrong weights right now.

We sample everyone who is present at at least one sampled location with equal probabilities not taking into account the increase in probability due to presence in multiple locations
weights are just probabilities of being present at all locations respondent is present at any of the considered locations instead of probability and does not take into account probability of being sampled within locality (which is likely related to size of locality too unless we use proportional sampling)

till-tietz commented 3 years ago

@gerasy1987

point 1 makes a lot of sense. This issue does not arise in the method of sampling a set number of units per cluster (implemented in the pull request relating to issue #37). For the original sampling method this should be an esay fix. We essentially just want each unit-location pair to be a row in a data.frame and sample from that right? (Or rather create a sampling weight vector based on number of time locations a unit is present at).
I think I see what you mean with point 2. We essentially just have the first piece of the weight equation (i.e. probability of being present at a location). In order to calcualte the weight do you then propose something like: p(unit present at time location) number of time location sampling rounds (probably just 1) location size?

till-tietz commented 3 years ago

pushed a proposed solution to point 1 (accounting for increased sampling probability due to presence in multiple locations) in branch sample-tls_sampling_probability

gerasy1987 commented 3 years ago

@till-tietz thanks. I'm reviewing the new branch now.

On your response to point 1, I'm not sure if I follow why we don't encounter that issue in case of fixed sampling per location, since in that case, we have a chance to sample the same person subject in multiple locations. So we need a fix for that case as well.

On response to the second issue, yes, I think that's correct; we only need to account in the case of a fixed number of units per location since for probability of being sampled for each hidden pop member will vary in each area (due to possible location size differences)

macartan commented 3 years ago

Hi both

a bit behind on this but are we taking account of the fact that people are only sampled once (I presume).

If qi is probability a location will be sampled and ri is probability a unit is sampled in a location, and i indexes places a person visits then: 1 - prod_i(1 - qi ri)

However this assumes locations sampled independently and also that likelihood of being present is independent which may not be true?

For such tricky sampling probabilities we can verify directly via repeated sampling that individuals are indeed sampled according to calculated probabilities.

On Sun, 3 Oct 2021, 22:22 Georgiy Syunyaev, @.***> wrote:

@till-tietz https://github.com/till-tietz thanks. I'm reviewing the new branch now.

On your response to point 1, I'm not sure if I follow why we don't encounter that issue in case of fixed sampling per location, since in that case, we have a chance to sample the same person subject in multiple locations. So we need a fix for that case as well.

On response to the second issue, yes, I think that's correct; we only need to account in the case of a fixed number of units per location since for probability of being sampled for each hidden pop member will vary in each area (due to possible location size differences)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gerasy1987/hiddenmeta/issues/26#issuecomment-933019260, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBE57N5POWWZDP7O3EYBS3UFC3QDANCNFSM5EKMJFXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

gerasy1987 commented 3 years ago

@macartan

Yes, we only sample each subject once, but we do not allow to add subjects (i.e., if we encounter one subject in two different locations, we count them as one, and thus the resulting sample size can be slightly smaller).

As for the weight, yes, we should actually use this formula instead of what we do now.

On the independence of location sampling/subject presence, I agree the way we implement it now is not very flexible, but I think we need to make this simple structure work first. Mainly because it seems to be very hard to specify the location sampling and/or subject presence interdependence systematically, we can probably implement this later.

To follow up, I prefer to rewrite the whole sampling procedure using the stacking approach where we stack all sampled locations on top of each other and then sample from this dataset. This actually allows us to directly implement visibility into this sampling that is currently largely omitted. Then we can collapse across repeated subjects, avoiding the messy way we currently record multi-location presence. Finally, we can calculate the weights using the approach @macartan proposed. I will take a stab at this.

till-tietz commented 3 years ago

@till-tietz thanks. I'm reviewing the new branch now.

On your response to point 1, I'm not sure if I follow why we don't encounter that issue in case of fixed sampling per location, since in that case, we have a chance to sample the same person subject in multiple locations. So we need a fix for that case as well.

On response to the second issue, yes, I think that's correct; we only need to account in the case of a fixed number of units per location since for probability of being sampled for each hidden pop member will vary in each area (due to possible location size differences)

@gerasy1987

I probably misunderstood the issue, but I thought the problem was that sampling subjects with equal probability and then recording multi location presence didn't properly reflect the actual sampling procedure/ increased probability of sampling a unit that e.g. is present at all locations. What I meant to say is that the method currently implemented for fixed sampling does account for this as it samples in each location independently. Maybe we can discuss all this via zoom tomorrow.

gerasy1987 commented 3 years ago

@till-tietz: big rewrite of TLS sampling in 13be661 and 9ef58e6 attempting to solve weighting issue. Also, we now allow to sample all units not just hidden group if hidden_var = NULL is specified. Would be great if you check if this looks ok

till-tietz commented 3 years ago

@gerasy1987 Will review everything tonight + tomorrow.

till-tietz commented 3 years ago

The revised sample_tls function looks good to me. The purrr::when function was a really handy addition. I'll run some tests locally just to double check things.

One small thing: in target_cluster_type <- match.arg(target_cluster_type) we might have to specify NULL as the first argument as match.arg has no default value for its arg argument I think.

gerasy1987 / hiddenmeta

check the weights in sample_tls() #26