OHDSI / Aphrodite

[in development]
Apache License 2.0
37 stars 15 forks source link

feature mismatch between different sites #17

Closed SSMK-wq closed 4 years ago

SSMK-wq commented 4 years ago

Hello Team,

I read your paper on phenotype valkdation and portability. When you ran the model at your site, Lets say you had more distinct measurements and conditions records in your dataset leading to lot of features in your dataset. Finally the model picks the best features. However, when you send your model to another site who may not have all the binary features (conditions and measurements), how did you address this issue?

I understand we can preprocess the target site features (if they have lot of unecessary features) but what if they dont have features that we expect (because our source model is built using those features and for validation we expect to see thise features). Did you just create those features and put in 0 as they arent present?

For ex - site A has f1,f2,f3, f4, f5, f6 ...f10. (10 features) I built a model using site A data.. Now I want to validate this model in site B who have only f1,f2,f3. (3 features only). Should I create djmmy f4,f5...f10 with value 0?

Your inputs on how this is addressed would be helpful

jmbanda commented 4 years ago

When you send the binary model, you also send it with a list of features. The other site extracts their features, the features missing get added zeroed out, and the extra features get removed. This is because the feature vectors need to match exactly for the 'transferred' model to make any predictions. If you share only the steps to build the model (keyword lists), then you are comparing the model building process, not the model it-self.

SSMK-wq commented 4 years ago

Hi @jmbanda,

One follow-up question on this.

a) We generate a keyword list (of concept ids) to label the datasets. But during model building, we have to remove those features (concepts) which were used to label the datasets? So, we can find out whether there are any other useful features that can help us distinguish between positive and negative cases. If we don't exclude, we might get the same concept_ids that we used (if it was present in patient records) as top important features. I tried and that's what I see. So, am I right to understand that we have to remove them?

b) Or since the validation dataset is gonna be a totally different dataset (from other sites), we don't have to remove them? But in this case, we can just apply the same imperfect heuristic (because the keywords used to label come up as important features and I don't need a model at all).

I am a bit confused. May I kindly request your help with this?

jmbanda commented 4 years ago

Hello,

a) Yes, this is to avoid these being the top predictive features and actually forcing the model to be built on 'everything else' those patients have in common.

b) No, this is not the case. If you are doing the model portability you should remove all the features not present in the original model. If you are just using the same 'heuristic' you should use... the same heuristic, otherwise you won't be able to have comparable results.