StuntsPT / Structure_threader

A wrapper program to parallelize and automate runs of "Structure", "fastStructure" and "MavericK".
GNU General Public License v3.0
24 stars 11 forks source link

Issue with seed_generator function when thread count is a multiple of replications #93

Closed heitor-mendonca closed 1 year ago

heitor-mendonca commented 1 year ago

Hello,

I encountered an issue when running structure_harvester on my output due to the standard deviation of the est Ln Pr(Data) being 0 for some K values. I observed that the seed used for all of the reps of some K values was identical, which seems unintended. I am not a programmer, but upon examining the seed_generator function in the source code, it appears that this behavior shouldn't occur.

The problem seems to occur when every rep of a given K fits perfectly into a single processing "chunk." For instance, if I run 12 reps with 24 threads, the program maps all reps from two K values onto 24 threads simultaneously. This might cause the seed_generator function to produce the same seed for each run, possibly because both random.seed and random.randrange use the same system parameter as an initial seed, resulting in the same seed for every tuple.

As a workaround, I can ensure the thread count (-t) isn't a multiple of the replication count (-R), so the program doesn't calculate seeds for every rep of a K simultaneously. However, this doesn't address the underlying issue of multiple replications running with the same seed, which reduces the effective replication size since some reps are essentially "clones."

I don't have formal programming training, but having been able to reproduce and temporarily resolve the issue, I thought it would be prudent to inform you. Can you please look into this issue and confirm if my observations are valid, or if I'm misunderstanding the situation?

Thank you for your attention to this matter.

heitor-mendonca commented 1 year ago

PS: I found that another developer has already addressed a similar issue using a patch. The repository containing the patch is located at https://github.com/jacqueslagnel/ParaStructure/. They address Structure's flawed seed generating process with the "ran.jl.patch" file. Unfortunately, I was unable to apply the patch to my build of Structure, and since structure_threader handles seed generation internally, I suspect it wouldn't resolve the issue either.

StuntsPT commented 1 year ago

Dear @heitor-mendonca , Thank you for your feedback. Apologies for the delayed response, somehow I did not get an email notification when the issue was raised. I can confirm the issue you mention, as well as the workaround. I am currently working on a fix. Expect more information soon(ish).

StuntsPT commented 1 year ago

Humm... the issue seems trickier than I first thought. Structure_threader is correctly providing different seed values for each run using the -D switch from STRUCUTRE. This can be highlighted in the logs with grep run in the results directory:

grep "arguments" *_f

It seems, however that STRUCTURE is ignoring this argument and generating it's own seed based on system time, in seconds... This is a bug in STRUCTURE, rather than Structure_threader but I will try to find a way to work around it. The patch from ParaStructure simply mitigates the issue by modifying structure to generate the seed from the wall clock, but based on miliseconds. Although this is just another workaround, compiling a new STRUCTURE binary with this patch might suffice to make things work in the corner case you pointed out. I will see what I can do.

heitor-mendonca commented 1 year ago

Thank you for looking into this issue and providing a detailed explanation. I appreciate your attention to this matter. Even without formal programming training, I can understand that this is a tricky problem to solve. I observed that the issue doesn't recur for other reps, likely because they're not assigned to threads simultaneously. I wonder if introducing a delay between each thread's initialization at the beginning of the run might help to resolve the issue. However, I'm not sure how to implement such a solution, and I understand that it might not be the most elegant approach.

Regardless, I'm glad that you're working on a potential workaround, and I'm eager to see the results. Thank you again for your efforts in addressing this issue.

StuntsPT commented 1 year ago

Found it! STRUCTURE will happily ignore the -D switch (to set the seed from the command line) if the parameter "RANDOMIZE" which gets read from the extraparams file is set to 1. Unfortunately, the default value of this param is 1 in the extraparams file. Even worse, if no extraparams file is read, the default value for this parameter is also 1 internally (which makes no sense, as it should automatically be set to 0 if the -D switch is passed, but I digress). The solution is to use an extraparams file, and set the parameter "RANDOMIZE" to 0. I will consider patching the STRUCTURE binary to default to 0, but this is not a decision I wish to take alone, and might take a while to take action on it (or not). Once again, thank you for pointing this out, and at least now you have the "official" workaround. =-) Please let me know if it works for you too.

heitor-mendonca commented 1 year ago

I really appreciate your thorough investigation and the effort you've put into finding a solution for this issue. It's great to hear that you've identified the cause of the problem and suggested a workaround by using the extraparams file.

I will give it a try and see if it resolves the issue for my use case. I'll let you know if it works for me as well.

Once again, thank you for addressing this issue, and I look forward to any future developments regarding a potential patch to the STRUCTURE binary.

heitor-mendonca commented 1 year ago

I tried the fix and completely forgot to give you feedback, sorry about that.

It seems to be working as intended, thanks again for the help!