RoonakR / RandomisedTrialsEmulation

https://gravesti.github.io/RandomisedTrialsEmulation
0 stars 0 forks source link

skip data preparation for testing? #1

Open gravesti opened 3 years ago

gravesti commented 3 years ago

Would it be feasible to reuse switch_data.csv when testing? The data preparation step seems to work fine, but I run out of memory in the model fitting part. Would be nice to skip the data step while testing different configurations.

RoonakR commented 3 years ago

Would it be feasible to reuse switch_data.csv when testing? The data preparation step seems to work fine, but I run out of memory in the model fitting part. Would be nice to skip the data step while testing different configurations.

what do you mean? Sorry, I don't get what do you mean by testing different configurations?

gravesti commented 3 years ago

Like different numbers of cores or different options for bigmemory. Do those things affect the model fitting?

Also I have different options on the HPC with memory/number of CPUs. Currently the process is getting killed because of out of memory.

RoonakR commented 3 years ago

Like different numbers of cores or different options for bigmemory. Do those things affect the model fitting?

Also I have different options on the HPC with memory/number of CPUs. Currently the process is getting killed because of out of memory.

Different numbers of cores only affect the running time. I have to check the different options of bigmemory and I guess it only affects running time/memory. How much memory you use? for model fitting parglm use so much memory so I can't do anything on that.

gravesti commented 3 years ago

My first try on the cluster I had 64gb. I will request more next time.

For bigmemory, I suppose the effect is not on how much memory the analysis uses, but how much memory it has access to, based on the system configuration. That was the first problem I had yesterday because there was not much shared memory available.

The case-control sampling seemed to work well though. :)

lisu-stats commented 3 years ago

I think the data after case-control sampling are saved in the working directory ('temp.csv') so they can be reused for model fitting. It may be useful to add a separate function for the model fitting step in addition to the all-in-one function 'Initiators'.

gravesti commented 3 years ago

I managed to create a dataset, but it was 200gb and it took all night. It crashed once it started the model fitting. I'll try again with even more memory.

I think I'll had an option to skip the data expansion. It looks like I can just jump over this line https://github.com/RoonakR/RandomisedTrialsEmulation/blob/5b603ed8d6c73edbe7769ee42ed9d9def86e2241/R/data_manipulation.R#L174

lisu-stats commented 3 years ago

Did you use case-control sampling? You can adjust the number of controls to further reduce the data size. The default is 5.

I managed to create a dataset, but it was 200gb and it took all night. It crashed once it started the model fitting. I'll try again with even more memory.

I think I'll had an option to skip the data expansion. It looks like I can just jump over this line https://github.com/RoonakR/RandomisedTrialsEmulation/blob/5b603ed8d6c73edbe7769ee42ed9d9def86e2241/R/data_manipulation.R#L174

RoonakR commented 3 years ago

I managed to create a dataset, but it was 200gb and it took all night. It crashed once it started the model fitting. I'll try again with even more memory.

I think I'll had an option to skip the data expansion. It looks like I can just jump over this line https://github.com/RoonakR/RandomisedTrialsEmulation/blob/5b603ed8d6c73edbe7769ee42ed9d9def86e2241/R/data_manipulation.R#L174

Do you mean to use the saved data if it's available instead of doing it again? I can add a variable so the user can ask to skip the data manipulation and only do the model fitting.

RoonakR commented 3 years ago

I think the data after case-control sampling are saved in the working directory ('temp.csv') so they can be reused for model fitting. It may be useful to add a separate function for the model fitting step in addition to the all-in-one function 'Initiators'.

Yeah, if it's useful I can add this feature.

lisu-stats commented 3 years ago

I think the data after case-control sampling are saved in the working directory ('temp.csv') so they can be reused for model fitting. It may be useful to add a separate function for the model fitting step in addition to the all-in-one function 'Initiators'.

Yeah, if it's useful I can add this feature.

Thanks a lot. That would be very helpful.

gravesti commented 3 years ago

So, I did a rough attempt at this (in the test_ig branch) and I could run the analysis with case control sampling!

RoonakR commented 3 years ago

So, I did a rough attempt at this (in the test_ig branch) and I could run the analysis with case control sampling!

I saw that you changed the type to integer but I am wondering if it would be the case for different datasets.

gravesti commented 3 years ago

I saw that you changed the type to integer but I am wondering if it would be the case for different datasets.

Yes, it might not be a good choice generally. From the documentation of bigmemory::big.matrix I understood that integers will only use 4 bytes instead of 8 for double. I need to test if this really does change the memory usage.

RoonakR commented 3 years ago

I saw that you changed the type to integer but I am wondering if it would be the case for different datasets.

Yes, it might not be a good choice generally. From the documentation of bigmemory::big.matrix I understood that integers will only use 4 bytes instead of 8 for double. I need to test if this really does change the memory usage.

Could you please let me know the main outcome of this issue and what exactly I have to modify? Thank you so much.

gravesti commented 3 years ago

Concretely I think the data preparation and the modelling parts of of initiators should put into their own functions. Then initiators could still call both of them, but the user would have the chance to run different parts of the code. This is especially useful if the data expansion is working but there are problems with the modelling.

Once #2 is fixed the dataset should be much smaller, so we can forget about double/integers.

RoonakR commented 3 years ago

Concretely I think the data preparation and the modelling parts of of initiators should put into their own functions. Then initiators could still call both of them, but the user would have the chance to run different parts of the code. This is especially useful if the data expansion is working but there are problems with the modelling.

Once #2 is fixed the dataset should be much smaller, so we can forget about double/integers.

Sure, I will separate the preparation and modelling. For #2, I updated the code so if you could please let me know if it's fixed or not that would be appreciated.

RoonakR commented 3 years ago

Concretely I think the data preparation and the modelling parts of of initiators should put into their own functions. Then initiators could still call both of them, but the user would have the chance to run different parts of the code. This is especially useful if the data expansion is working but there are problems with the modelling.

Once #2 is fixed the dataset should be much smaller, so we can forget about double/integers.

I updated the code and added data_preparation and data_modelling functions. Let me know which variables/parameters you think is better to remove, add or unchanged, please. thanks.