Open gravesti opened 3 years ago
Would it be feasible to reuse switch_data.csv when testing? The data preparation step seems to work fine, but I run out of memory in the model fitting part. Would be nice to skip the data step while testing different configurations.
what do you mean? Sorry, I don't get what do you mean by testing different configurations?
Like different numbers of cores or different options for bigmemory. Do those things affect the model fitting?
Also I have different options on the HPC with memory/number of CPUs. Currently the process is getting killed because of out of memory.
Like different numbers of cores or different options for bigmemory. Do those things affect the model fitting?
Also I have different options on the HPC with memory/number of CPUs. Currently the process is getting killed because of out of memory.
Different numbers of cores only affect the running time. I have to check the different options of bigmemory and I guess it only affects running time/memory. How much memory you use? for model fitting parglm use so much memory so I can't do anything on that.
My first try on the cluster I had 64gb. I will request more next time.
For bigmemory, I suppose the effect is not on how much memory the analysis uses, but how much memory it has access to, based on the system configuration. That was the first problem I had yesterday because there was not much shared memory available.
The case-control sampling seemed to work well though. :)
I think the data after case-control sampling are saved in the working directory ('temp.csv') so they can be reused for model fitting. It may be useful to add a separate function for the model fitting step in addition to the all-in-one function 'Initiators'.
I managed to create a dataset, but it was 200gb and it took all night. It crashed once it started the model fitting. I'll try again with even more memory.
I think I'll had an option to skip the data expansion. It looks like I can just jump over this line https://github.com/RoonakR/RandomisedTrialsEmulation/blob/5b603ed8d6c73edbe7769ee42ed9d9def86e2241/R/data_manipulation.R#L174
Did you use case-control sampling? You can adjust the number of controls to further reduce the data size. The default is 5.
I managed to create a dataset, but it was 200gb and it took all night. It crashed once it started the model fitting. I'll try again with even more memory.
I think I'll had an option to skip the data expansion. It looks like I can just jump over this line https://github.com/RoonakR/RandomisedTrialsEmulation/blob/5b603ed8d6c73edbe7769ee42ed9d9def86e2241/R/data_manipulation.R#L174
I managed to create a dataset, but it was 200gb and it took all night. It crashed once it started the model fitting. I'll try again with even more memory.
I think I'll had an option to skip the data expansion. It looks like I can just jump over this line https://github.com/RoonakR/RandomisedTrialsEmulation/blob/5b603ed8d6c73edbe7769ee42ed9d9def86e2241/R/data_manipulation.R#L174
Do you mean to use the saved data if it's available instead of doing it again? I can add a variable so the user can ask to skip the data manipulation and only do the model fitting.
I think the data after case-control sampling are saved in the working directory ('temp.csv') so they can be reused for model fitting. It may be useful to add a separate function for the model fitting step in addition to the all-in-one function 'Initiators'.
Yeah, if it's useful I can add this feature.
I think the data after case-control sampling are saved in the working directory ('temp.csv') so they can be reused for model fitting. It may be useful to add a separate function for the model fitting step in addition to the all-in-one function 'Initiators'.
Yeah, if it's useful I can add this feature.
Thanks a lot. That would be very helpful.
So, I did a rough attempt at this (in the test_ig branch) and I could run the analysis with case control sampling!
So, I did a rough attempt at this (in the test_ig branch) and I could run the analysis with case control sampling!
I saw that you changed the type to integer but I am wondering if it would be the case for different datasets.
I saw that you changed the type to integer but I am wondering if it would be the case for different datasets.
Yes, it might not be a good choice generally. From the documentation of bigmemory::big.matrix I understood that integers will only use 4 bytes instead of 8 for double. I need to test if this really does change the memory usage.
I saw that you changed the type to integer but I am wondering if it would be the case for different datasets.
Yes, it might not be a good choice generally. From the documentation of bigmemory::big.matrix I understood that integers will only use 4 bytes instead of 8 for double. I need to test if this really does change the memory usage.
Could you please let me know the main outcome of this issue and what exactly I have to modify? Thank you so much.
Concretely I think the data preparation and the modelling parts of of initiators
should put into their own functions.
Then initiators
could still call both of them, but the user would have the chance to run different parts of the code. This is especially useful if the data expansion is working but there are problems with the modelling.
Once #2 is fixed the dataset should be much smaller, so we can forget about double/integers.
Concretely I think the data preparation and the modelling parts of of
initiators
should put into their own functions. Theninitiators
could still call both of them, but the user would have the chance to run different parts of the code. This is especially useful if the data expansion is working but there are problems with the modelling.Once #2 is fixed the dataset should be much smaller, so we can forget about double/integers.
Sure, I will separate the preparation and modelling. For #2, I updated the code so if you could please let me know if it's fixed or not that would be appreciated.
Concretely I think the data preparation and the modelling parts of of
initiators
should put into their own functions. Theninitiators
could still call both of them, but the user would have the chance to run different parts of the code. This is especially useful if the data expansion is working but there are problems with the modelling.Once #2 is fixed the dataset should be much smaller, so we can forget about double/integers.
I updated the code and added data_preparation and data_modelling functions. Let me know which variables/parameters you think is better to remove, add or unchanged, please. thanks.
Would it be feasible to reuse switch_data.csv when testing? The data preparation step seems to work fine, but I run out of memory in the model fitting part. Would be nice to skip the data step while testing different configurations.