bodkan / slendr

Population genetic simulations in R 🌍
https://bodkan.net/slendr
Other
54 stars 5 forks source link

Add support for SLiM simulations on Windows #149

Closed bodkan closed 8 months ago

bodkan commented 9 months ago

This is inspired by the changes suggested by @GKresearch here.

I manageds to setup Windows 11 in a virtual machine on my Mac. It's not a great development experience but it's something!

The goal is to check if we can support the default means of installation described in Section 2.3.1 of the SLiM manual. Conda-installed SLiM should work as well, except that doesn't support SLiMgui which makes it not the best default target. That said, Conda users will be able specify the path to their slim binary directly via the slim_path= argument of the slim() function.

Speaking about paths -- I still haven't figured out how path management is supposed to work in Windows shell. In fact, it appears that the SLiM installation on Windows described in the SLiM manual is using a non-native non-Windows shell? That's why I'm considering making specification of path to a SLiM binary to be used (be it CLI SLiM or SLiMgui) mandatory on Windows.

Hey @GKresearch, if you're still interested in helping, it would be great if you could test the new version of slendr with Windows support on your computer, after I manage to put everything together.

Thanks for pushing me to do this and sorry again for the huge delay!

bhaller commented 9 months ago

In fact, it appears that the SLiM installation on Windows described in the SLiM manual is using a non-native non-Windows shell?

The main way for people to install on Windows nowadays, I think, is the pacman installer, I think? I don't know anything about Windows, but I think that installation route is pretty standard there...?

bodkan commented 9 months ago

The main way for people to install on Windows nowadays, I think, is the pacman installer, I think? I don't know anything about Windows, but I think that installation route is pretty standard there...?

Yeah, that's what I understood as well (looks to me that pacman is something like Homebrew on macOS but Β―\(ツ)/Β― I haven't used Windows in 15 years).

What I meant is that I it looked to me that CLI programs installed via pacman appear to be using some sort of bash ported for Windows but not the actual Windows "Command Prompt" (I think that's how it used to be called) or the "newer" Powershell. This is what the colored terminal screenshots in SLiM manual are from.

So this wasn't actually about the installation itself but about running those programs. In any case, it's all good -- this wasn't meant to be a criticism of any kind. Or it wasn't meant to express surprise or anything. Again, I don't use Windows, and have no prior expectations about any of this. :)

bodkan commented 9 months ago

Ah, hold on. I can run slim.exe through the good old Windows System Prompt that I remember from my teenage years. :) What threw me off is that the shell spawned by msys2 is a bash shell (not the Windows prompt), and same was true for git, etc. So I assumed something more special is going on with the compilation of programs ported from unix to Windows.

Also, when I said

In fact, it appears that the SLiM installation on Windows described in the SLiM manual is using a non-native non-Windows shell?

I actually meant pacman / msys2 compiled programs in general. But the above reads as I'm talking specifically about SLiM, which wasn't my intention, sorry.

bhaller commented 9 months ago

No worries, just checking that you were using pacman not some other thing like the WSL installation procedure, which (I think?) is more fringe.

GKresearch commented 9 months ago

Hi @bodkan,

Thank you for looking into this! Yeah, I'm happy to try running the new version when it is ready!

bodkan commented 8 months ago

Wow. This is the longest I've worked on a PR, and I don't think I've faced a more frustrating problem in slendr's history. :D The Windows/shell/R world is like a completely different planet. Things are sort of similar but not really the same. (I'm looking at you, Windows file paths with forward slashes, backward slashes, double backward slashes, quadruple backward slashes, ugh.)


Status update -- I think after almost a week of fighting with this (setting up everything on GitHub Actions was SUCH A PAIN...), I can declare partial victory:

Error in py_get_attr_impl(x, name, silent) : 
    SystemError: <functools._lru_cache_wrapper object at 0x000001DCD7EF2090> returned
    a result with an exception set

What's puzzling is that this only happens sometimes. On some SLiM tree sequences (not msprime tree sequences), there's no error and they are loaded from disk correctly. In other cases, this error happens. This smells like a weird memory-issue to me which is consistent with this being a LRU cache error which must be happening waaaay down on some low level.

I will now focus on generating some SLiM tree sequences without slendr and loading them on Windows with pure Python, just to confirm the problem is general, and not slendr or R related. It shouldn't be -- after all, slendr's slim() function just runs a SLiM script and saves a .trees file, and ts_load() then calls tskit.load() in Python under the hood -- but I need to narrow down where the problem is. Maybe it's a reticulate-related error? This could potentially also explain a memory error. Not sure just yet. Β―\(ツ)/Β―

[EDIT: Wait, it shouldn't be a reticulate error because this doesn't happen while loading msprime-generated tree sequences from disk.]

@bhaller: Do you use anything for running Windows in a virtual machine on your Mac? Do you have a Windows laptop lying around somewhere? Or do you perhaps rely on GitHub Actions Windows servers for testing? I'm asking whether I should try another way of getting Windows to run and if you have any tips on trying a different means of getting SLiMgui.exe to run (I use UTM which makes it possible to virtualize Windows for free but I'm not an expert on this).


@GKresearch, if you're interested in helping out, could you please install this WIP version of slendr and try it out? Perhaps running some SLiM-based spatial vignettes. You could get it by devtools::install_github("bodkan/slendr", ref = "slim-on-windows"). You mentioned that you've been playing around with slendr for your work, so perhaps even just testing whether you can still do the things you've been doing on this version of slendr that you could do with your own edited version would be great, nothing more is needed now. If things break, you can always go back to loading your own edited version for the time being.

I think you mentioned you have SLiM from conda, which excludes the possibility of running SLiMgui but that's fine for now. I can't seem to be able to run it either. You could either run things via the slim(..., slim_path = <path to slim.exe>) or, there's a new possibility by setting the path by editing the ~/.Renviron file like it's done on Linux/macOS. One easy way to do this is to open you R console, call Sys.getenv("PATH") to get your current default $PATH contents, and then create a file under C:\Users\<username>\Documents\.Renviron with something like this:

PATH="<the contents of the Sys.getenv("PATH") above>;<directory where the slim binary can be found>"

For instance, my virtualized Windows R gives me this (a lot of noisy garbage with C:\\msys64\\mingw64\\bin in the middle, where the slim.exe binary is sitting):

> Sys.getenv("PATH")
[1] "C:\\rtools43\\x86_64-w64-mingw32.static.posix\\bin;C:\\rtools43\\usr\\bin;
C:\\\\rtools43\\\\x86_64-w64-mingw32.static.posix\\\\bin;C:\\\\rtools43\\\\usr\\\\bin;
C:\\\\Program Files (x86)\\\\R\\\\R-4.3.2\\\\bin\\\\x64;C:\\\\Windows\\\\system32;C:\\\\Windows;
C:\\\\Windows\\\\System32\\\\Wbem;C:\\\\Windows\\\\System32\\\\WindowsPowerShell\\\\v1.0\\\\;
C:\\\\Windows\\\\System32\\\\OpenSSH\\\\;C:\\\\Program Files\\\\Git\\\\cmd;C:\\\\Users\\\\mp\\\\AppData\\\\Local\\\\Microsoft\\\\WindowsApps;
C:\\\\Program Files\\\\RStudio\\\\resources\\\\app\\\\bin\\\\quarto\\\\bin;
C:\\\\Program Files\\\\RStudio\\\\resources\\\\app\\\\bin\\\\postback;C:\\msys64\\mingw64\\bin;
C:\\Program Files\\RStudio\\resources\\app\\bin\\quarto\\bin;C:\\Program Files\\RStudio\\resources\\app\\bin\\postback"

If everything goes well, when you next start R and run Sys.which("slim") or Sys.which("slim.exe") (I don't know what conda installs), you should get the full path. The slim() function should then be able to find SLiM automatically and run simulations without you having to specify slim_path= manually.

Many thanks for your help!



Unit testing log so far (I disabled unit tests depending on SLiM or loading SLiM-made tree sequences to make sure everything else works first):

==> devtools::test()

β„Ή Testing slendr
βœ” | F W  S  OK | Context
βœ” |          4 | abstract-vs-real-landscapes                           
βœ” |      2  22 | compilation [36.8s]                                   
βœ” |         18 | coordinate-conversions [1.4s]                         
βœ” |          9 | dynamics [7.5s]                                       
βœ” |      1  12 | engines [12.8s]                                       
βœ” |      1  15 | geneflow [11.7s]                                      
βœ” |         25 | geometrical-operations [1.3s]
βœ” |      4  20 | ibd-squashing [13.4s]                                 
βœ” |         23 | ibd [385.6s]                                          
βœ” |      1   0 | interaction-changes                                   
βœ” |          2 | interactive-elements                                  
βœ” |         43 | manual-ts [3.6s]                                      
βœ” |      1   0 | msprime-geneflow                                      
βœ” |          8 | msprime-metadata [9.4s]                               
βœ” |         11 | msprime [456.8s]                                      
βœ” |          6 | naturalearth [5.2s]                                   
βœ” |      1   0 | nonserialized [2.4s]                                  
βœ” |      3  56 | population [17.8s]                                    
βœ” |          7 | pure-msprime-vs-slendr [16.4s]                        
βœ” |      1   0 | pure-slim-vs-slendr [2.2s]                            
βœ” |      1   0 | resizes-locked [2.2s]                                 
βœ” |      1   0 | resizes [2.2s]                                        
βœ” |      3   4 | runners [11.9s]                                       
βœ” |      4  27 | sampling [14.8s]                                      
βœ” |         17 | serialization [59.1s]                                 
βœ” |      1   0 | simulation-runs [2.2s]                                
βœ” |      2  44 | time-direction [9.0s]                                 
βœ” |      1   0 | trees [2.2s]                                          
βœ” |      1   0 | ts-ancestors-descendants [2.2s]                       
βœ” |      1   0 | ts-pure-nonspatial [2.3s]                             
βœ” |      1   0 | ts-pure-spatial [2.3s]                                
βœ” |      1   0 | ts [2.4s]                                             
βœ” |          9 | tspop [20.2s]                                         
βœ” |          7 | utililites                                            

══ Results ════════════════════════════════════════════════════════════

[ FAIL 0 | WARN 0 | SKIP 32 | PASS 389 ]
bhaller commented 8 months ago

Hi @bodkan. I've been watching your commits, since I get emailed automatically, and have been amazed by the mighty battle you have been fighting. Reminiscent of Gandalf and the Balrog.

Unfortunately, I've been running into very strange, unpredictable issues when loading slendr slim()-generated tree sequences on Windows. As far as I can tell, the errors are not coming from slendr, they are not even coming from SLiM. They happen during loading of SLiM-produced (but not msprime-produced, curiously -- see the previous item) tree sequences at what appears to be a very very low level of the tskit/pyslim (not sure which yet) Python layer. Specifically, for the same simulation I sometimes (well, quite often, but not always), get this puzzling error:

Note that there is a very large memory leak associated with loading tree sequences in SLiM 4.1; see the announcement of this bug here: https://groups.google.com/g/slim-discuss/c/bJP5d1dx848. It has been fixed, so the GitHub head version of SLiM should be good. SLiM 4.2 will be released as soon as I'm back from Mexico – about a month and a half, maybe a bit less. I don't know whether this is related to what you have been seeing or not.

@bhaller: Do you use anything for running Windows in a virtual machine on your Mac? Do you have a Windows laptop lying around somewhere? Or do you perhaps rely on GitHub Actions Windows servers for testing? I'm asking whether I should try another way of getting Windows to run and if you have any tips on trying a different means of getting SLiMgui.exe to run (I use UTM which makes it possible to virtualize Windows for free but I'm not an expert on this).

Sorry, I'm no help with this at all. I do not run Windows, either virtually or otherwise; I have no Windows machine. I rely on the SLiM community, almost entirely @rdinnager, to handle SLiM's Windows stuff, and GitHub Actions CI tells me if there is a problem with the Windows build. Good luck!

rdinnager commented 8 months ago

Hi @bodkan, as @bhaller mentioned, I am generally responsible for keeping the Windows version of SLiM working, and I am happy to try and help with slendr on Windows if I can. I have started by running the ''Spatially annotated tree sequences" vignette on my local Windows machine (after installing via devtools::install_github("bodkan/slendr", ref = "slim-on-windows")). The good news is that slendr worked fine right out of the box, seemingly picking up my already installed SLiM binary automatically (I didn't have to set slim_path). I have run the vignette twice now and have not seen any errors yet, though the vignette is quite slow to run so it will take awhile to test an appreciable number of reruns. Do you have any suggestions for other code to test? Perhaps something a bit faster or where you have explicitly seen this error? I can try and help debug in Python if I can reproduce the error locally.

Are you getting this issue only on your Windows virtual machine, or also on the github actions setup? Perhaps it is something specific to the virtual machine? SLiMGui runs fine on my Windows machine, so if there is anything you'd like me to test with that, let me know.

On a side note, I have been meaning to connect with you at some point about slendr because I am also the primary developer and maintainer of slimr (https://github.com/rdinnager/slimr), another R package that uses SLiM under the hood, but which I think has a fairly different goal from slendr. BTW, I think slendr looks like a really nice package an is very cool! Anyway, I feel like there could be some useful ways we could combine our efforts in future in this SLiM <-> R 'space' so to speak, perhaps. That is somewhat vague, but I thought it would be good to introduce myself and float the idea of a possible conversation in the near-ish future, since I am here ;).

codecov[bot] commented 8 months ago

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (0bc83ab) 89.95% compared to head (89f5a1f) 89.87%.

Files Patch % Lines
R/engines.R 88.00% 3 Missing :warning:
R/utils.R 62.50% 3 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #149 +/- ## ========================================== - Coverage 89.95% 89.87% -0.09% ========================================== Files 8 8 Lines 3037 3042 +5 ========================================== + Hits 2732 2734 +2 - Misses 305 308 +3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

bodkan commented 8 months ago

Hi @bodkan. I've been watching your commits, since I get emailed automatically, and have been amazed by the mighty battle you have been fighting. Reminiscent of Gandalf and the Balrog.

Haha, oh man, that's amazing (I'm a massive Tolkien nerd, so this comparison made me strangely proud in a really funny way). I'm afraid this is the one situation where being obsessive perfectionist can be quite destructive. :) Thank god I had fun things to do/read on the side as I've been battling this. I need to take a break now though and pick it up in a few days again. I did learn useful things along the way... I'm astonished by the black magic the R devs must be doing to make everything run on Windows.

Note that there is a very large memory leak associated with loading tree sequences in SLiM 4.1; see the announcement of this bug here: https://groups.google.com/g/slim-discuss/c/bJP5d1dx848. It has been fixed, so the GitHub head version of SLiM should be good. SLiM 4.2 will be released as soon as I'm back from Mexico – about a month and a half, maybe a bit less. I don't know whether this is related to what you have been seeing or not.

Oh, interesting. I saw the initial post on the mailing list but I missed your follow up update. Somehow I don't think this is what's going on though. The simulations are not crashing. The exception is thrown when a tree sequence is loaded into Python, after SLiM finishes running (I suspect successfully, although I haven't had a chance to dig into this yet).

I wonder if this could be some low level issue caused by the Windows virtualization. Because the same exact error happens on both Windows virtualized on my Mac, and also on the Windows machines that GitHub uses. Or their own virtualized machines (I actually don't know if G.A. runs some containers or true VMs or if they have physical servers with different OS).

In any case, thank you for reminding me of the SLiM memory leak, I'll keep it in mind. Thank you!

bodkan commented 8 months ago

Hello @rdinnager! Nice to meet you. Big fan of your work on slimr! Your package is a perfect example of why I love R: I love metaprogramming, designing custom DSLs, that sort of thing. Can't imagine the magic you had to do to make slimr happen, awesome stuff.

Also, wow -- thank you so much for jumping to help with the Windows issues. I really appreciate it! You're right, the vignettes are very slow to fully render, as they are supposed to contain "realistic" examples, and a lot of them. That said, I'm really glad that the one vignette that you tried running using code from this branch works OK. That particular vignette uses everything that (spatial) slendr and its tskit interface have to offer -- that it worked for you is amazing news, suggesting that I'm on the right track here.

Regarding further tests you could do, I think that a more efficient use of your time -- if you really had time to help with this more -- would be to clone the slendr repo locally on Windows, switch to the slim-on-windows branch and try running devtools::test() to check the unit tests on native, non-VM Windows. These tests still take a while, but they include dozens of different smaller simulations, so the chance for a crash (as I see it in my VM Windows) is much more frequent.

However, I don't think it makes sense for you to spend your time on this before I reconnect the SLiM bits that I temporarily removed in the unit tests when I was checking if the pure tskit and msprime related tests work (and that it's really the SLiM simulations that are causing issues). It might be that if you cloned the repo in the current state, you'd be getting errors which have nothing with the weird tskit-based LRU cache errors, but that I haven't reconnected all the bits and pieces properly.

Once I do this, perhaps I could select one particular unit test that causes those occasional crashes and ask you to run this repeatedly. Unfortunately (well, fortunately for my well being :D) I have to run now but I'll jump on this again after the weekend. I'll ping you once I think the tests are ready for testing on your computer.

Thank you very very much!

Are you getting this issue only on your Windows virtual machine, or also on the github actions setup?

Interestingly, the same exact error happens on both Windows virtualized on my Mac, and also on the Windows machines that GitHub uses. Or their own virtualized machines (I actually don't know if G.A. runs some containers or true VMs or if they have physical servers with different OS). So one question I've been actually wondering about is whether this could be somehow related to things being ran in a virtualized way... like, a strange binary/architecture incompatibility at the Python level. Then again, it seems that it's not like a particular run crashes every time. Sometime a SLiM tree sequence loads OK, sometime the process crashes. I'll need to do more investigation because so far I haven't been able to figure out a seed that always causes a crash -- and who knows, if it's more of a memory issue, maybe there isn't a seed I can even find.


On a side note, I have been meaning to connect with you at some point about slendr because I am also the primary developer and maintainer of slimr (https://github.com/rdinnager/slimr), another R package that uses SLiM under the hood, but which I think has a fairly different goal from slendr. BTW, I think slendr looks like a really nice package an is very cool! Anyway, I feel like there could be some useful ways we could combine our efforts in future in this SLiM <-> R 'space' so to speak, perhaps. That is somewhat vague, but I thought it would be good to introduce myself and float the idea of a possible conversation in the near-ish future, since I am here ;).

Yes! As I said, I'm a big fan of your work on slimr. Unlike you, I took a very different path -- I see the (spatial) functionality of slendr almost as an "app" implemented in SLiM rather than an interface to SLiM -- but I definitely agree we should chat at some point in the future. At some point we decided to focus on fleshing out the features of slendr within the "niche" and limits we explicitly set out for it, rather than expanding towards support for more and more SLiM features in future releases... but that doesn't mean people wouldn't be interested in using additional functionality of SLiM inside slendr, which I don't/won't explicitly support. I did wonder already some time ago that slimr might be useful for this, but haven't thought about it too deeply. I will take a look at what's new in slimr and get in touch later. :)


Totally unrelated fun fact: YEARS ago as a PhD student I needed to automate huge SLiM simulations (I think it was relatively shortly after SLiM 2 came out)... and I implemented an R package called... slimr! :D Never released, and never will be: https://github.com/bodkan/slimr. As far as I remember, the main point was data analysis of SLiM output files (mutations, individuals, genomes, etc.), not actually running SLiM code.

I should update the GitHub page stating that this has nothing to do with your slimr, that my stuff shouldn't be used by anyone at all, and that at this point it's nothing more than a historical curiosity and experiment. :)

rdinnager commented 8 months ago

Hello @rdinnager! Nice to meet you. Big fan of your work on slimr! Your package is a perfect example of why I love R: I love metaprogramming, designing custom DSLs, that sort of thing. Can't imagine the magic you had to do to make slimr happen, awesome stuff.

Thanks for the kind words about slimr. I really appreciate them! I agree, I love this aspect of R, and so writing slimr was a lot of fun (and I probably spent way too much time on it).

Also, wow -- thank you so much for jumping to help with the Windows issues. I really appreciate it! You're right, the vignettes are very slow to fully render, as they are supposed to contain "realistic" examples, and a lot of them. That said, I'm really glad that the one vignette that you tried running using code from this branch works OK. That particular vignette uses everything that (spatial) slendr and its tskit interface have to offer -- that it worked for you is amazing news, suggesting that I'm on the right track here.

Regarding further tests you could do, I think that a more efficient use of your time -- if you really had time to help with this more -- would be to clone the slendr repo locally on Windows, switch to the slim-on-windows branch and try running devtools::test() to check the unit tests on native, non-VM Windows. These tests still take a while, but they include dozens of different smaller simulations, so the chance for a crash (as I see it in my VM Windows) is much more frequent.

However, I don't think it makes sense for you to spend your time on this before I reconnect the SLiM bits that I temporarily removed in the unit tests when I was checking if the pure tskit and msprime related tests work (and that it's really the SLiM simulations that are causing issues). It might be that if you cloned the repo in the current state, you'd be getting errors which have nothing with the weird tskit-based LRU cache errors, but that I haven't reconnected all the bits and pieces properly.

Once I do this, perhaps I could select one particular unit test that causes those occasional crashes and ask you to run this repeatedly. Unfortunately (well, fortunately for my well being :D) I have to run now but I'll jump on this again after the weekend. I'll ping you once I think the tests are ready for testing on your computer.

Thank you very very much!

No problem! Just let me know when the repo is ready and I'm happy to do some local tests on my machine.

Are you getting this issue only on your Windows virtual machine, or also on the github actions setup?

Interestingly, the same exact error happens on both Windows virtualized on my Mac, and also on the Windows machines that GitHub uses. Or their own virtualized machines (I actually don't know if G.A. runs some containers or true VMs or if they have physical servers with different OS). So one question I've been actually wondering about is whether this could be somehow related to things being ran in a virtualized way... like, a strange binary/architecture incompatibility at the Python level. Then again, it seems that it's not like a particular run crashes every time. Sometime a SLiM tree sequence loads OK, sometime the process crashes. I'll need to do more investigation because so far I haven't been able to figure out a seed that always causes a crash -- and who knows, if it's more of a memory issue, maybe there isn't a seed I can even find.

Yeah, sounds strange. I believe G.A. does use virtual machines for their runners, but I don't know much about the details...

On a side note, I have been meaning to connect with you at some point about slendr because I am also the primary developer and maintainer of slimr (https://github.com/rdinnager/slimr), another R package that uses SLiM under the hood, but which I think has a fairly different goal from slendr. BTW, I think slendr looks like a really nice package an is very cool! Anyway, I feel like there could be some useful ways we could combine our efforts in future in this SLiM <-> R 'space' so to speak, perhaps. That is somewhat vague, but I thought it would be good to introduce myself and float the idea of a possible conversation in the near-ish future, since I am here ;).

Yes! As I said, I'm a big fan of your work on slimr. Unlike you, I took a very different path -- I see the (spatial) functionality of slendr almost as an "app" implemented in SLiM rather than an interface to SLiM -- but I definitely agree we should chat at some point in the future. At some point we decided to focus on fleshing out the features of slendr within the "niche" and limits we explicitly set out for it, rather than expanding towards support for more and more SLiM features in future releases... but that doesn't mean people wouldn't be interested in using additional functionality of SLiM inside slendr, which I don't/won't explicitly support. I did wonder already some time ago that slimr might be useful for this, but haven't thought about it too deeply. I will take a look at what's new in slimr and get in touch later. :)

Yes, at the very least I plan to point slimr users to slendr for its tree-sequence processing ability, for anyone who wants to use tree sequences in slimr. Planning to start working on a vignette for that soon.

Totally unrelated fun fact: YEARS ago as a PhD student I needed to automate huge SLiM simulations (I think it was relatively shortly after SLiM 2 came out)... and I implemented an R package called... slimr! :D Never released, and never will be: https://github.com/bodkan/slimr. As far as I remember, the main point was data analysis of SLiM output files (mutations, individuals, genomes, etc.), not actually running SLiM code.

I should update the GitHub page stating that this has nothing to do with your slimr, that my stuff shouldn't be used by anyone at all, and that at this point it's nothing more than a historical curiosity and experiment. :)

I think I did see your slimr way back when, well after I had already decided on slimr myself (I've been working on the other slimr for a surprisingly long time). Apologies that I nicked the name. I thought I had used available::available() at the time, so I don't know why it didn't come up as taken (or perhaps I am remembering wrong). The same thing happened to me recently created a package called tidysdm (https://github.com/rdinnager/tidysdm) to do species distribution modeling using the tidymodels framework. Just a few months ago a package with the same name came out from another group: https://cran.r-project.org/web/packages/tidysdm/index.html (much more comprehensive and polished than mine). I wrote the package for a course I was teaching, and it wasn't really meant to be released, so I didn't mind too much, but it did cause some confusion for my students last semester! I should probably also put a disclaimer on my tidysdm that it might not be what people are looking for.

bodkan commented 8 months ago

Hello @rdinnager,

I think I managed to reconnect all the internal bits and the slim-on-windows branch is in a state where all the tests pass except those that involve running slendr models through SLiM and loading a tree sequence out of those simulations.

Curiously, the error is always the same, regardless of the context / type of model / etc (note that this is a Python error, suggesting the crash happens someplace during loading / accessing simulated tree-sequence file, not SLiM itself):

<python.builtin.SystemError/python.builtin.Exception/python.builtin.BaseException/python.builtin.object/error/condition>
Error in `py_get_attr_impl(x, name, silent)`:
SystemError: <functools._lru_cache_wrapper object at 0x000001C43361E610> returned a result with an exception set

And the backtrace:

Backtrace:
< ... skipping internal testthat trace which is not important here ... >
  4. └─slendr::slim(model, sequence_length = 100, recombination_rate = 0)
  5.   └─slendr::ts_load(model, file = output) at slendr/R/engines.R:401:5
  6.     β”œβ”€ts$metadata at slendr/R/tree-sequences.R:74:3
  7.     └─reticulate:::`$.python.builtin.object`(ts, "metadata") at slendr/R/tree-sequences.R:74:3
  8.       └─reticulate:::py_get_attr_or_item(x, name, TRUE)
  9.         └─reticulate::py_get_attr(x, name)
 10.           └─reticulate:::py_get_attr_impl(x, name, silent)

The trace suggests that it always happens when the Python metadata member (class variable? not sure how Python calls this) of a ts Python tree-sequence object is accessed. The slendr/R/tree-sequences.R:74:3 points to R code ts$metadata which is what's executed as ts.metadata at a Python level through reticulate. So this isn't even something slendr-related per se, I would think. And it doesn't seem to be deterministic (some tests which crash at one point don't seem to be crashing later).

If you have a bit of time to help @rdinnager:

Could you please clone the latest slim-on-windows branch of slendr locally (git clone https://github.com/bodkan/slendr/ && git checkout slim-on-windows), open R in the repo directory, run devtools::test() and report back what errors do you get? I hope I fixed everything else except this one type of crash error, so ideally you shoud

a) not get any errors at all (which would suggest an issue related to virtualization/architecture, perhaps); b) only get the weird LRU cache errors involving the ts.metadata access in Python.

If it's a) then I'll just assume it's related to the virtualization/containers/something and merge this PR... because this has already taken up way more time that I should be allowed to put in it. πŸ˜¬πŸ˜…

If it's b) then it must be some error at a Python level... and I'll be tempted to merge the PR anyway :D because it already contains valuable commits for the rest of the non-Windows msprime/tskit part of the codebase so the code is overall much more robust... and I'll just open a new issue purely focused on the Python error above.

Many, many thanks for your help!

(I'll reply to the other slimr-related topics over email!)


What I'm also going to try sometimes this week is to isolate this with pure SLiM and Python code. Maybe run a tiny SLiM simulation many times and load it with pure Python many times, and see if it also breaks sometimes? This might take a while because I have no idea how to use Windows for coding. :) RStudio in the Windows VM is extremely slow but at least it's something.

bhaller commented 8 months ago

Hey @bodkan, do you think it makes sense to ping Ben Jeffery (tskit) about this? Or do you want to do more legwork first? It might ring a bell for him, you never know.

bodkan commented 8 months ago

Hey @bodkan, do you think it makes sense to ping Ben Jeffery (tskit) about this? Or do you want to do more legwork first? It might ring a bell for him, you never know.

I was thinking about doing that but I think I'll first wait for what @rdinnager finds out on his non-VM'd Windows setup.

I'm generally very hesitant pinging tskit folks on slendr issues because of the slendr -> reticulate -> R -> Python -> tskit indirection... and I feel like this is even worse because it involves running things on a virtualized system and not natively (yet). πŸ™„

But yes, if @rdinnager finds the same error even on his Windows system (ideally even without involving slendr) then I would probably involve tskit people directly, because this would indicate a lower level problem.

Thanks!

GKresearch commented 8 months ago

Hi @bodkan,

It seems I've missed quite a bit, but I am installing the current version of slendr now and will let you know how it plays out.

bodkan commented 8 months ago

Hello @GKresearch,

what you missed are walls of text formed entirely by my very deep confusion about what's going on.

If you would like to help with testing (thank you!), here's the only relevant bit from my last message. Ignore everything else.

Could you please clone the latest slim-on-windows branch of slendr locally (git clone https://github.com/bodkan/slendr/ && git checkout slim-on-windows), open R in the repo directory, run devtools::test() and report back what errors do you get? I hope I fixed everything else except this one type of crash error, so ideally you shoud

a) not get any errors at all (which would suggest an issue related to virtualization/architecture, perhaps); b) only get the weird LRU cache errors involving the ts.metadata access in Python.

If it's a) then I'll just assume it's related to the virtualization/containers/something and merge this PR... because this has already taken up way more time that I should be allowed to put in it. πŸ˜¬πŸ˜…

If it's b) then it must be some error at a Python level... and I'll be tempted to merge the PR anyway :D because it already contains valuable commits for the rest of the non-Windows msprime/tskit part of the codebase so the code is overall much more robust... and I'll just open a new issue purely focused on the Python error above.

Many, many thanks for your help!

GKresearch commented 8 months ago

@bodkan,

Will do, I'll let you know what I get!

GKresearch commented 8 months ago

Got a bunch of errors associated with sf and rnaturalearth, I opened in a new project without the library, I'll retry with them installed.

GKresearch commented 8 months ago

No errors, I'll try running the main vignette and let you know what happens

bodkan commented 8 months ago

Ah, yes. Sorry. I assumed since you said you played around with the slendr tutorials that you have sf and rnaturalearth present. You will need the stars as well. All three are required for spatial slim() simulations, and there are many unit tests which run those.

bodkan commented 8 months ago

No errors, I'll try running the main vignette and let you know what happens

Wait. Hold on. Did you run the full devtools::test() and got no errors? Would you be so kind and paste the whole output of that command? I.e., the output that begins with this:

==> devtools::test()

β„Ή Testing slendr
βœ” | F W  S  OK | Context
<...>

All the way to the final line which looks like this:

[ FAIL ... | WARN ... | SKIP ... | PASS ... ]
GKresearch commented 8 months ago

I do have them installed. I cloned the repository into a new project in R, and forgot to account for the fact that the dependencies didn't transfer.

GKresearch commented 8 months ago

Yeah, looks like it still couldn't find slim, now that I'm looking at it closer, I'm wondering if the "git checkout slim-on-windows" command didn't work for me, though I'm not sure why.

β„Ή Testing slendr The 'slim.exe' binary could not be found in your $PATH. Most of the functionality of slendr will work without any issues but you will not be able to simulate data with the slim() function.

If you want to run SLiM simulations, make sure to modify the $PATH variable so that it points to the directory containing the slim command-line program. One easy way to do this is to add this:

PATH="C:\path\to\directory\with\slim.exe\binary"

to your C:\Users\\Documents.Renviron file.

Alternatively, use the slim_path argument of the slim() function.

A slendr Python (3.12) environment with the necessary versions of msprime (1.3.0), tskit (0.5.6), pyslim (1.0.4), and tspop (0.0.2) has not been found.

You can setup a pre-configured environment with all of slendr's Python dependencies automatically by running the function setup_env(). βœ” | F W S OK | Context βœ” | 4 | abstract-vs-real-landscapes
βœ” | 2 11 | compilation [8.2s]
βœ” | 18 | coordinate-conversions
βœ” | 9 | dynamics [3.7s]
βœ” | 1 0 | engines [2.8s]
βœ” | 1 8 | geneflow [5.0s]
βœ” | 25 | geometrical-operations [1.4s]
βœ” | 1 0 | ibd-squashing [2.8s]
βœ” | 1 0 | ibd [2.6s]
βœ” | 1 15 | interaction-changes [3.6s]
βœ” | 2 | interactive-elements
βœ” | 1 0 | manual-ts [2.6s]
βœ” | 1 0 | msprime-geneflow [2.7s]
βœ” | 1 0 | msprime-metadata [2.7s]
βœ” | 1 0 | msprime [2.8s]
βœ” | 6 | naturalearth [3.1s]
βœ” | 1 0 | nonserialized [2.8s]
βœ” | 3 40 | population [8.7s]
βœ” | 1 0 | pure-msprime-vs-slendr [2.6s]
βœ” | 1 0 | pure-slim-vs-slendr [2.5s]
βœ” | 1 0 | resizes-locked [2.5s]
βœ” | 1 0 | resizes [2.6s]
βœ” | 1 0 | runners [2.4s]
βœ” | 5 15 | sampling [14.4s]
βœ” | 17 | serialization [34.1s]
βœ” | 1 0 | simulation-runs [3.0s]
βœ” | 1 0 | time-direction [2.8s]
βœ” | 1 0 | trees [2.9s]
βœ” | 1 0 | ts-ancestors-descendants [2.6s]
βœ” | 1 0 | ts-pure-nonspatial [2.7s]
βœ” | 1 0 | ts-pure-spatial [2.6s]
βœ” | 1 0 | ts [2.6s]
βœ” | 1 0 | tspop [2.7s]
βœ” | 7 | utililites

══ Results ════════════════════════════════════════════════════════════════════════════════════ Duration: 142.3 s

── Skipped tests (33) ───────────────────────────────────────────────────────────────────────── β€’ !is_slendr_env_present() is TRUE (32): test-compilation.R:65:1, test-engines.R:1:1, test-geneflow.R:50:3, test-ibd-squashing.R:1:1, test-ibd.R:1:1, test-interaction-changes.R:51:3, test-manual-ts.R:1:1, test-msprime-geneflow.R:7:1, test-msprime-metadata.R:1:1, test-msprime.R:23:1, test-nonserialized.R:1:1, test-population.R:43:3, test-population.R:61:3, test-population.R:144:3, test-pure-msprime-vs-slendr.R:4:1, test-pure-slim-vs-slendr.R:4:1, test-resizes-locked.R:1:1, test-resizes.R:1:1, test-runners.R:1:1, test-sampling.R:70:3, test-sampling.R:128:3, test-sampling.R:230:3, test-sampling.R:257:3, test-sampling.R:290:3, test-simulation-runs.R:1:1, test-time-direction.R:1:1, test-trees.R:1:1, test-ts-ancestors-descendants.R:1:1, test-ts-pure-nonspatial.R:1:1, test-ts-pure-spatial.R:1:1, test-ts.R:1:1, test-tspop.R:1:1 β€’ interactive() is TRUE (1): test-compilation.R:55:3

[ FAIL 0 | WARN 0 | SKIP 33 | PASS 177 ]

bodkan commented 8 months ago

Ah, yeah. OK. It looks like your slendr environment isn't set up to run the most recent development slendr version. This makes sense because until now you were running a previous slendr version, and since then at least msprime had a new release, which isn't reflected in the Python environment you currently have:

A slendr Python (3.12) environment with the necessary versions of
msprime (1.3.0), tskit (0.5.6), pyslim (1.0.4), and tspop (0.0.2)
has not been found.

(That's a standard greeting message that always prompts the user to re-run setup_env() if any of the msprime/tskit/pyslim have been updated.)

And also, you don't have the slim.exe binary in your $PATH. I forgot you said you've been running everything by setting the slim_path= argument of slim() manually. That's why the unit tests can't even find slim.exe because they assume SLiM is in the $PATH -- it always is in my development environments. So the SLiM-based unit tests are skipped because of this, and even the msprime/tskit unit tests are skipped because you don't have the Python environment present.

I think having you jump through the hoops to set it all up would be wasting your time at this point, so maybe I'll ping you once this PR is finished and installable with normal devtools::install_github() or perhaps even from CRAN using install.packages().

Thanks for jumping in though!

rdinnager commented 8 months ago

I cloned the repo and ran devtools::test(). I got the same errors as you were getting @bodkan, but I think I've tracked down the source of the error. After running the test I took one of the failing tests and put it into a script I could rerun a bunch of times. I can confirm that a single example run over and over sometimes works and sometimes failed (failing about a third of the time in this case). After a failure I was able to use reticulate::py_last_error() to get the lower level error message from Python:

reticulate::py_last_error()

── Python Exception Message ──────────────────────────────────────────────────────────────
OverflowError: Python int too large to convert to C long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\rdinn\AppData\Local\R-MINI~1\envs\PYTHON~1.2\Lib\site-packages\tskit\trees.py", line 4421, in metadata
    return self.metadata_schema.decode_row(self._ll_tree_sequence.get_metadata())
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rdinn\AppData\Local\R-MINI~1\envs\PYTHON~1.2\Lib\site-packages\tskit\trees.py", line 4428, in metadata_schema
    return metadata_module.parse_metadata_schema(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SystemError: <functools._lru_cache_wrapper object at 0x00000155F40E66C0> returned a result with an exception set

The culprit is: OverflowError: Python int too large to convert to C long

This is likely related to the use of a type of 'int' or 'long' somewhere in tskit. The likely reason it is only failing on Windows is because 'int' and 'long' types on Windows C are always 32bit (well, at least on MINGW, don't ask me why), but on linux it is 64bit (https://stackoverflow.com/questions/384502/what-is-the-bit-size-of-long-on-64-bit-windows). So on Windows sometimes this integer overflows in some simulations, but not in linux. The solution is to use a type that is 64bit on both Windows and linux instead (e.g. 'int64_t'), I think.

According to the traceback, the issue is caused by calling metadata_module.parse_metadata_schema(), so that would be the place to start looking in tskit I guess or whatever C code gets called by this function. I will leave this for you to raise with the tskit developers. I'm up against the limits of my understanding of C and Python with this, so I could be wrong about the exact cause or solution, but this seems like a promising area to look (it is something to do with integers I'm pretty sure).

bodkan commented 8 months ago

Whoah! So cool! You got already one step ahead of me, @rdinnager . :) So far I focused only on running things in testthat batch mode until I make sure all the other issues are fixed… and, somewhat frustratingly, for some reason, testthat and/or reticulate truncate the traceback, so the OverflowError never got propagated into the testhat log, only the downstream SystemError. πŸ™„ The fact that one has to run an extra function just to get the full traceback seems like a problem, especially for unit tests which are usually run in non-interactive mode...

I’m not deeply familiar with tskit C internals either, but I agree with your assessment. This would also fit the observation that some simulations ran from the same script fail and some work β€” maybe the numeric IDs of some elements of a SLiM simulation (individual IDs maybe?) in some runs (but not others) don’t fit a 32bit integer?

Thank you so so much! Debugging this in a slow and tiny VM’d Windows virtual screen was a huge pain and I simply had to take a break from it. :D I am really glad you jumped in and did this so quickly.


Hello @benjeffery, I have been working on porting slendr (and its tskit R interface) to Windows and encountered strange non-deterministic crashes causing the following error upon loading SLiM tree sequences / accessing their metadata in Python using tskit (through the R reticulate package):

SystemError: <functools._lru_cache_wrapper object at 0x00000155F40E66C0> returned a result with an exception set

After inspecting the full Python traceback (please see the comment by @rdinnager just above this message) it turns out that the primary issue is this exception:

OverflowError: Python int too large to convert to C long

The full traceback is this:

OverflowError: Python int too large to convert to C long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\rdinn\AppData\Local\R-MINI~1\envs\PYTHON~1.2\Lib\site-packages\tskit\trees.py", line 4421, in metadata
    return self.metadata_schema.decode_row(self._ll_tree_sequence.get_metadata())
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rdinn\AppData\Local\R-MINI~1\envs\PYTHON~1.2\Lib\site-packages\tskit\trees.py", line 4428, in metadata_schema
    return metadata_module.parse_metadata_schema(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SystemError: <functools._lru_cache_wrapper object at 0x00000155F40E66C0> returned a result with an exception set

So the exception seems to be thrown by the method metadata_module.parse_metadata_schema().

@rdinnager who is an expert on working with Windows suggests that the root of the problem might be the fact that 'int' and 'long' types on Windows C are always 32bit (again, please se his message above).

Do you think this could be the cause of the problem I am seeing here and is this something that warrants a closer look on the tskit side of things?

For completeness, this only happens with some SLiM tree sequences when they are processed on Windows, it never happens (even with the exact same slendr simulation code) on Linux or macOS.

Or at least as far as I can test this by running slendr-based msprime and SLiM simulations across the three platforms.

Many thanks for any comments or help with this!

bodkan commented 8 months ago

Interestingly, now that the primary exception has been found, I found similar issues from other R packages using some Python modules underneath, like this one from the R interface to the Google Earth Engine Python package: https://github.com/r-spatial/rgee/issues/79

I'm not sure if this could help identify the exact issue. I looked at the tskit codebase a bit, checking if there's some C or Cython code that might be responsible but it's a lot to digest and it's a bit hard to navigate for me.

benjeffery commented 8 months ago

@bodkan Hi! I assume you're using the latest tskit as some time ago we increased the possible lengths of ragged columns such as metadata to a 64bit int. It may be that on windows the type may still be 32 bit. What length of metadata column causes this issue? You can tell by with len(ts.tables.metadata).

bhaller commented 8 months ago

@benjeffery Might it be a good idea to add a bit of code in tskit, at startup or some convenient spot like that, that just checks sizeof(X) for various types and prints an error and exits if they are not what you expect/need? Trivial to do, and would catch a problem like this on an unusual platform that doesn't use the C type sizes you expect...?

bodkan commented 8 months ago

Hello @benjeffery, thanks for jumping in.

I assume you're using the latest tskit [...]

I release a new slendr version whenever a new version of tskit or msprime or pyslim is released, so currently it includes tskit 0.5.6.

What length of metadata column causes this issue? You can tell by with len(ts.tables.metadata).

I'll try to get that as soon as I manage to isolate a broken simulation run in a fully reproducible way. I don't have Windows access and debugging this over GitHub actions or inside a super slow virtualized Windows machine has been what I imagine using punchcards in the 70s must've been like... The fact that this doesn't happen (for any given simulation) in every run has made investigating this even more challenging.

I'll be in touch as I get more info on this, thank you.

benjeffery commented 8 months ago

@benjeffery Might it be a good idea to add a bit of code in tskit, at startup or some convenient spot like that, that just checks sizeof(X) for various types and prints an error and exits if they are not what you expect/need? Trivial to do, and would catch a problem like this on an unusual platform that doesn't use the C type sizes you expect...?

I haven't had time to dig into this yet - but am surprised we don't have a test for it. Hopefully we can replicate in CI.

bodkan commented 8 months ago

So, I have to say I’m not yet 100% convinced it’s really a tskit-on-Windows issue rather than reticulate-on-Windows issue. In fact, so far I'm leaning towards the latter.

I did finally managed to create a tiny test case which produces a tree sequence which reproducibly causes the crash every time. This will hopefully make it possible to really dig deep to determine just what is it that’s causing the problem! From a casual glance I managed yesterday evening before I had to run home it’s something… utterly bizarre.

bodkan commented 8 months ago

OK, I don't remember dealing with such an obscure weird issue.

@rdinnager Would you mind testing one more thing for me? I managed to create a tree sequence which always breaks on my Windows virtual machine but I'd like to make sure it also happens on a native Windows system. Interestingly, this happens regardless of whether the tree sequence is simulated with SLiM or msprime.

I attach two tree sequences, output_slim.trees and output_msprime.trees (output_trees.zip). Accessing the metadata element of a ts tree-sequence object crashes for both (so it must be something that they have in common!). Could it be the seed maybe? Have to do more digging. Already getting to an always breaking simulation run was quite a struggle.

# restart R to get a clean environment (and no slendr)

# use Python installed by slendr (but this doesn't really have an effect)
library(reticulate)
use_condaenv("Python-3.12_msprime-1.3.0_tskit-0.5.6_pyslim-1.0.4_tspop-0.0.2") 

repl_python() # open a Python REPL inside R

# paste the following into the Python REPL
import tskit
ts = tskit.load("output_msprime.trees")

ts.metadata          # this works without any issues
ts.metadata["SLiM"]  # this works without any issues
# restart R to get a clean environment (and no slendr)

# use Python installed by slendr (but this doesn't really have an effect)
library(reticulate)
use_condaenv("Python-3.12_msprime-1.3.0_tskit-0.5.6_pyslim-1.0.4_tspop-0.0.2") 

# load a tree sequence directly using the Python/R reticulate interface
tskit <- import("tskit")
ts <- tskit$load("output_msprime.trees")

ts$metadata # this works
ts$metadata # but this breaks!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
# ... rinse and repeat

So accessing the metadata breaks but only on every other access?!?! What the hell? This must be some weird, super low-level memory problem related to the Python embedding / caching the metadata method call. But I'm totally out of my depth.

@rdinnager Could you please check that you observe the same behavior? (Note that in my VM it happens for both the SLiM and msprime tree sequence). At least then I'll know I'm not chasing ghosts but that the attached tree sequences break also for you in this way. Thank you very much.


Again, for completeness, here's the error in full:

> ts$metadata # but this breaks! WTF?!?!
Error in py_get_attr_impl(x, name, silent) :
  SystemError: <functools._lru_cache_wrapper object at 0x000001D8D7D5BE20> returned a result with an exception set
Run `reticulate::py_last_error()` for details.
> reticulate::py_last_error()

── Python Exception Message ────────────────────────────────────────────
OverflowError: Python int too large to convert to C long

The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
  File "C:\Users\mp\AppData\Local\R-MINI~1\envs\PYTHON~1.2\Lib\site-packages\tskit\trees.py", line 4421, in metadata
return self.metadata_schema.decode_row(self._ll_tree_sequence.get_metadata())
^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mp\AppData\Local\R-MINI~1\envs\PYTHON~1.2\Lib\site-packages\tskit\trees.py", line 4428, in metadata_schema
return metadata_module.parse_metadata_schema(
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    SystemError: <functools._lru_cache_wrapper object at 0x000001D8D7D5BE20> returned a result with an exception set

  ── R Traceback ─────────────────────────────────────────────────────────
  β–†
  1. β”œβ”€ts$metadata
  2. └─reticulate:::`$.python.builtin.object`(ts, metadata)
  3.   └─reticulate:::py_get_attr_or_item(x, name, TRUE)
  4.     └─reticulate::py_get_attr(x, name)
  5.       └─reticulate:::py_get_attr_impl(x, name, silent)

Here's a slendr script which generates a broken SLiM and msprime tree sequence. No need to run this, both tree sequences are attached to this message:

# simulate a "broken" tree sequence ---------------------------------------

devtools::load_all() # must be in a checked out slim-on-windows branch of slendr
init_env()

SEED <- 4106790818

p <- population(name = "pop1", N = 700, time = 1)
model <- compile_model(populations = p, generation_time = 30, simulation_length = 1000)

output_slim <- "output_slim.trees"
output_msprime <- "output_msprime.trees"

slim(model, sequence_length = 1e6, recombination_rate = 0, random_seed = SEED, output = output_slim, load = FALSE)
msprime(model, sequence_length = 1e6, recombination_rate = 0, random_seed = SEED, output = output_msprime, load = FALSE)
bodkan commented 8 months ago

@bodkan wrote: Accessing the metadata element of a ts tree-sequence object crashes for both SLiM and msprime tree sequences (so it must be something that they have in common!). Could it be the seed maybe? Have to do more digging. Already getting to an always breaking simulation run was quite a struggle.

Holy crap. It is the seed?!?

devtools::load_all()
init_env(quiet = TRUE)
MAX_INT <-  2147483647 # https://en.wikipedia.org/wiki/2,147,483,647#In_computing
SEED <- MAX_INT

p <- population(name = "pop1", N = 100, time = 1)
model <- compile_model(populations = p, generation_time = 30, simulation_length = 1000)

while (TRUE) { # infinite loop, careful
  cat("*")
  ts <- msprime(model, sequence_length = 1e3, recombination_rate = 0, random_seed = SEED)
}
MAX_INT <-  2147483647
SEED <- MAX_INT + 1       # <==== NOTE THE +1!

p <- population(name = "pop1", N = 100, time = 1)
model <- compile_model(populations = p, generation_time = 30, simulation_length = 1000)

while (TRUE) { # infinite loop except on Windows
  cat("*")
  ts <- msprime(model, sequence_length = 1e3, recombination_rate = 0, random_seed = SEED)
}

This is the weirdest, craziest bug I've ever encountered, ever. How hilarious that all this frustrating debugging and digging boils down to... a huge-ass number being a little bigger than allowed!

I still have no idea why ts.metadata crashes in reticulated Python on Windows only once every second access (seriously, WTF?!). If it's int overflow, I would expect it to happen every time? But I really have no idea. Clearly this has something to do with the caching of the metadata access method (the primary error I was getting was related to method caching).

@rdinnager, I think you can ignore my request for help in the message above. I'll adjust the range of integers that slendr picks a random seed from and also put a constraint on the range of user-provided random seeds -- for both slim() and msprime() functions.

Then I'll run the unit tests again and see what happens. 🀞

bodkan commented 8 months ago

Oh! It doesn't look like slendr is generating a random seed in case the user doesn't provide any. If no random seed is explicitly specified, then slendr lets SLiM or msprime pick their own.

This is interesting because if SLiM tends to provide a wider range of integers as random seeds than msprime, it would explain why it's been only slim()-based unit tests failing this whole time, never msprime unit tests -- which is what threw me off originally as I assumed it's something specific to SLiM tree sequences. Fun!

Given that the error happens someplace slendr has no control over (low-level interface between R/embedded-Python), maybe a compromise (at least intermediate one) is for slendr to actually generate a random seed of its own, restricted to be less than .Machine$integer.max.

bhaller commented 8 months ago

Wow, craziness. Nice debugging!

bodkan commented 8 months ago

OK, so all the slendr unit tests are passing across macOS / Linux / Windows with no exceptions. I also verified that all the vignettes (which contain much more ambitious simulations runs) are building on all three systems.

It really was the random seed being bigger than a 32-bit integer causing the intermittent crashes on Windows during parsing of the metadata (in which slendr saves dictionary with some custom metadata, including the seed used). Some very obscure interaction at the level of R-Python interface that's impossible for me to explain because it would require digging into that interface at a C level. I'm sure it would be fun to figure out but I can't put more time into this.

As far as slendr is concerned, this is fixed by making sure that random seeds are less than 2,147,483,647 which makes Windows happy in the current state of things.

Once this is merged, I will put out a new slendr version to release the Windows support to the wild. I'm really happy to have this in because porting the entire unit test suite to Windows required changes to make the code more robust overall.

I'm pretty sure there are still smaller issues here and there but discovering those would require using slendr and it's tskit interface on Windows for more serious work. Not something I can manage on a slow-oh-my-god-why-is-it-so-slow virtualized Windows system with a hilariously low screen resolution (I can't even get a proper fullscreen RStudio view! :)). Those bugs can be taken care of with individual GitHub issues.

What a journey this has been. Thank you everyone for helping out!

image