Qs Doris - Githubissues

Dear Dr. Meinfelder and Ms. Stingl,

our group (2a. Tree-based and pmm-based Multiple Imputation methods) would be grateful if there was a possibilty to have a meeting with you on Monday, 29 January form 10 am to 2 pm to discuss our presentation topic. In particuarly, we would like to talk about following questions:

Code:

Implementation of missings in the data: is it enough to make 5 to 6 variables partially missing according to a certan scheme or we have to make every variable partilly missing, making it more real-like data?
Accroing to our topic description, we need to use data with 10 to 50 variables. Does it mean that we have to make multiple simulation studies with diffrerent datasets or one data set with one simulation study is enough?
It is a known procedure in MI to either have a long form of imputing or a wide form. Long form here means that we, if we do 10 imputations for example, that it will copy the whole dataset 10 times, which can be time consuming but is the traditional way to impute. Wide form only copies the variables that are being imputed into additional columns, which is time saving? Is this something we can reconsider in the code and if we should have an eye on (SHOULD we do long/wide form?)
Since we are using tree-based methods, which do not assume anything about the data (no relationship), do we still have to consider this in our analysis and look for something out, that might have a relationship? Or should we try out to use categorial data and mix this with other data types?
Should we mix MCAR and MAR as your example in the lecture or should we also consider MNAR? Can tree-based model overcome this super difficult task and give unbiased coefficients?
Should we run MCAR and then MAR separately and then look at the results or should we mix it as shown in the example in the lecture.
Does the type of a variable make a difference in tree-based imputation? e.g. keeping binary variable as binary or should this be swapped to as.factor()?
To what exetent should the code be commented?

Tree-based methods:

Do we have to describe MICE approach in some detail or we assume that to be a background knowledge and completely skip it?
Is there a more extended description of tree-based methods needed?
Is the diffrerence between mice and miceRanger sufficiently explained or there is a need to dive deeper?

References:

Is there citation and reference style we should follow?
What approximate number of references is expected?
Are R packages manuals have to be referenced?

Attached you can find our prelimniary results: R-Code, data set used as well as poster.

Questions to ask about the code:

1) It is a known procedure in MI to either have a long form of imputing or a wide form. Long form here means that we, if we do 10 imputations for example, that it will copy the whole dataset 10 times, which can be time consuming but is the traditional way to impute. Wide form only copies the variables that are being imputed into additional columns, which is time saving? -> is this something we can reconsider in the code and if we should have an eye on (SHOULD we do long/wide form?)

2) Since we are using tree-based methods, which do not assume anything about the data (no relationship), do we still have to consider this in our analysis and look for something out, that might have a relationship? Or should we try out to use categorial data and mix this with other data types?

3) Should we mix MCAR and MAR as your example in the lecture or should we also consider MNAR? Can tree-based model overcome this super difficult task and give unbiased coefficients?

4) Should we run MCAR and then MAR separately and then look at the results or should we mix it as shown in the example in the lecture.

5) does the type of a variable make a difference in tree-based imputation? e.g. keeping binary variable as binary or should this be swapped to as.factor()?

Forschungsfrage
randomForest schneller?
simulationsstudie in die richtung geschwindigkeitvergleich
- 2 tage maximal laufzeit
nicht auf mechanismen schauen
"+": bias, coverage rate, "-" mse
MNAR: weg
grund von abweichungen: varianzbedingt oder biasbedingt: coverage weg? kategoriale variable sind zeitfresser
5-6 metrische und andere kategoriale (ein Szenario)

Szenarien: unterschiedliche kategoriale variablen (mit unterschiedlichen Ausprägungnen)

Code: schon gut kommentiert

mice als Vorwissen, nicht viel tree-based and random forest zeigen reference: konsistenzen referencen: rund 10 r packete zitieren

time-slot 30 minuten: 20 minuten präsentation

Fragestellung Was ist der Unterschied zwischen mice und miceRanger?  Die Laufzeit – Deshalb sollte hier auch der Fokus gesetzt werden! Herausfinden  Was sind somit die Bremseinrichtungen, die miceRanger schneller macht als mice?  Worauf lässt sich das auch theoretisch aufbauen? Herangehensweise: 1) Erstmal: Einen hohen Anteil an missings (50%) mit einem kleineren Anteil an missings vergleichen 2) Nicht großartig mit den Imputationsparameter rumspielen – Qualität der Imputationen -> Das ist nicht der Punkt der Präsentation Coverage von MiceRanger  Wir wissen ganz wenig von MiceRanger  Warum haben wir coverage rate von 60%  Der Grund der Abweichung ist Varianzbedingt oder Bias-bedingt?  Wenn das eher auf den Bias zu schieben, dann lässt die coverage weg.  Coverage und Bias wären hier interessant. MSE nur zweitrangig

Weiteres: MNAR und MCAR und MAR auch von der Laufzeit vergleichen?  MNAR NICHT MACHEN!

Laufzeitvergleiche: wo sollen wir sparen Simulationszyklen
Auf 100 runter. Für stabilen Schätzer beim Bias und root-mean error.
So auch Laufzeit besser zu handlen. Datensatz
Anzahl der fehlenden Variablen:5 bis 6 Und dann 15 bis 20 hoch.
43 Variablen, die fehlen, ist zu viel
Kategoriale Variablen sind der größte Zeitfresser.
Aufteilung nach 5 -6 metrische und dann nochmal kategoriale Daten von missings
4 Szenarien, die bleiben: 2 verschiedene Variablen an missing und zwei verschiedene Anteile an missings.

Variablen  Plan rausnehmen, als kategoriale Daten: Zu viele Ausprägungen  Machen sie nur eine Anzahl an fehlenden Werten – 10 Werten  Nehmen sie die vermeintlichen Zeitfresser – Faktoren mit sehr vielen Ausprägungen  Die Variablen mit sehr vielen Ausprägungen viel Zeit fressen, da viele Kombinationen ausprobiert werden  -> als as.factor -> Muss nicht sein. -> Das kann rf. Selber machen. -> da müssen wir nichts machen.

Zur Datenhaltung: Long oder wide Format – Ist auch egal. -> Kein Fass aufmachen

Rest: Der Code ist gut kommentiert! Das passt so  Den Code sollte man nachvollziehen können. Cart können wir drinnen lassen – Ist auf Augenhöhe mit anderen rf.  Als Fazit. Bei Prädiktionen ist unterlegen aber gleichauf sonst. Präsentation

Präsentation nicht länger als 22 Minuten.
Für unsere Gruppe: 13 Uhr am zweiten Tag.
100 Iterationen runtergehen und die zeitfressenden Prädiktoren weglässt. -> 2 Tage Obergrenze. -> und nicht in 34 Variablen missings. Verschiedenen Spezifikationen: Welche kommen mit rein?  Datensets schreiben und dann diese durchlaufen.  10 Prädiktoren, die missings sind  Unterschiedliche Mechanismen reinbringen  Festgesetzten Rahmen: MCAR und MAR  Und dann variieren wir das umfeld. Wieviele Variablen darf es nutzen, um es zu implementieren  Welche Faktoren spielen da eine Rolle, die Geschwindigkeitsschneller.  Welche Paket hat seine Stärken für welche Art von Variablen

Kategoriale Variablen mit unterschiedlichen Ausprägungen und mit oder nicht mit einbauen.  Missingness Mechanismus – Kontrolliertes Umfeld

Unklarheiten: Was ist damit gemeint?

Unterschiedliche Variablen, die missings aufweisen verwenden, gepaart gekreuzt mit Anteil an missings.

asluchych / tree-based-mi

Qs Doris #6