EricArcher / strataG

strataG is a toolkit for haploid sequence and multilocus genetic data summaries, and analyses of population structure.
25 stars 12 forks source link

Converting fsc to gtypes #57

Closed konopinski closed 2 years ago

konopinski commented 2 years ago

Dear Eric, I have simulated some sequences with the following code:

demes <- fscSettingsDemes(
  fscDeme(1000,100),
  fscDeme(10000,100),
  fscDeme(100000,100),
  ploidy = 1
)

events <- fscSettingsEvents(
  fscEvent(10000,0,2,1,1,0,0),
  fscEvent(10000,1,2,1,1,0,0)
)
nloci = 666L
seqlength = 150L
genetics <- data.frame(chromosome = rep(1,nloci), 
           actual.type = rep("DNA",nloci), 
           fsc.type = rep("DNA",nloci), 
           num.markers = rep(seqlength,nloci), 
           recomb.rate = rep(0,nloci), 
           mut.rate = round(exp(-rnorm(nloci,mean = 19,sd = 1)),11), 
           param.5 = rep(1/3,nloci), 
           param.6 = rep(NA,nloci))
attributes(genetics)$class <- c("fscSettingsGenetics", "fscBlock", "data.frame"  )
attributes(genetics)$num.chrom <- 1
attributes(genetics)$chrom.diff <- FALSE

p <- fscWrite(demes,genetics,migration = NULL,events = events,
              est = NULL,def = NULL,label = "strataG.fsc",use.wd = TRUE)

num.sims <- 1000
p <- fscRun(p,num.sims = num.sims,num.cores = 12,seed = 545125)

I cannot convert the results to gtypes format. I constantly get an error:

2022-07-22 01:46:50 reading ./strataG.fsc/strataG.fsc_1_1.arp 
2022-07-22 01:46:52 parsing genetic data...
xxx <- fsc2gtypes(p,marker = "dna", sim = 1)
Error: the number of genes in 'sequences' is not equal to the number of loci

I have traced it down to the final definition of the gtypes object in df2gtypes function (lines 91-101 - the function new) but cannot go deeper as I'm not too familiar with S4 object.

Anyway, I have to emphasize again that I LOVE THE PACKAGE ;) I managed to analyse the data using other methods but I think you might find this issue interesting. Or did I do something wrong? I needed this strange definition of genetics because I wanted to simulate different mutation rates at different loci. Maybe the problem is in this part? I'm using the latest version from Github (2.5.01).

Cheers, Maciek

EricArcher commented 2 years ago

Maciek,

I can duplicate your error and can see where it occurs, but I haven't been able to fully work out why it is occurring, and I'm a bit slammed with other things at the moment. Nonetheless, when I recode your block specification as below, everything works. That leads me to believe there is something about your manual creation of the fscSettingsGenetics object that is off.

dna.blocks <- lapply(1:nloci, function(i) {
  fscBlock_dna(
    sequence.length = seqlength, 
    mut.rate = round(exp(-rnorm(1,mean = 19,sd = 1)), 11)
  )
})
genetics <- do.call(fscSettingsGenetics, dna.blocks)

Cheers, Eric

konopinski commented 2 years ago

Unfortunately, your solution did not work in my R. I forgot to add I'm using Ubuntu and R 4.2.1 in Rstudio. Do not worry about this as I found a working solution so this is just to let you know that there might be one more thing to improve. I would be happy to help you with it but I have to read more about defining S4 objects.

I am wondering why did you chose creating those objects in this form instead of just simple data.frame. That would make things easier sometimes ;)

EricArcher commented 2 years ago

Hmmm... Can you describe more clearly what didn't work? What errors do you see? There shouldn't be a difference in the code for R in Ubuntu and Mac (on which I develop). It would help to know if this is truly an error in the functions or an error in the way you constructed the object.

I use S4 explicitly because it gives me control in other functions of the form and values of these objects. It means the functions don't (shouldn't) have to check that all the pieces are there and formatted correctly because they may have been constructed by a user rather than the proper functions in the package ;)

konopinski commented 2 years ago

Sorry - I wasn't clear. I ment the blocks and settings for the fsc settup. They could be plain data.frames without the special attributes. I guess this would work until the user knows exactly what does she/he wants, but you'd probably have many more conversations like this one. Well I think I have found some cue. fsc2gtypes works perfectly with only a single DNA block, while when there are more blocks provided it crushes with the same error (Error: the number of genes in 'sequences' is not equal to the number of loci). I got the same error using the method suggested by you (lapply with dna_block and so on), the default definition and my 'manual' method. The parameter file is fine as the simulation in fsc works fine (actually, this is why I started using strataG - I found it so difficult to write parameter file without mistakes). I thought it might be p but df2gtypes does not touch p. I wish I could solve it myself.

EricArcher commented 2 years ago

Unfortunately, it is harder for me as a developer to make them plain data.frames. There is internal checking that happens and the other functions need to know that this checking has been done and that they conform to the expected structure. You're right - people would be more tempted to make their own and I'd have a lot more debugging conversations.

Could you send me the code that generates the error you're seeing (with my 'lapply' code to define the genetic structure) and a copy of the screen output. I tested the code I sent above on my system and although I can duplicate your initial error, my code does not generate an error and works as designed. There may be a bug in package code, but this suggests that if there is, it is due to some system differences. I would really like to track it down and to do so, I want to make sure I'm running exactly what you're running.

konopinski commented 2 years ago

obraz

konopinski commented 2 years ago

With:

genetics <- data.frame(chromosome = rep(1,nloci), 
                       actual.type = rep("DNA",nloci), 
                       fsc.type = rep("DNA",nloci), 
                       num.markers = rep(seqlength,nloci), 
                       recomb.rate = rep(0,nloci), 
                       mut.rate = round(exp(-rnorm(nloci,mean = 19,sd = 1)),11), 
                       param.5 = rep(1/3,nloci), 
                       param.6 = rep(NA,nloci))
attributes(genetics)$class <- c("fscSettingsGenetics", "fscBlock", "data.frame"  )
attributes(genetics)$num.chrom <- 1
attributes(genetics)$chrom.diff <- FALSE

and

genetics <- fscSettingsGenetics(
  fscBlock_dna(sequence.length = seqlength,
               mut.rate = round(exp(-rnorm(1,mean = 19,sd = 1)), 11)
  ),
  fscBlock_dna(sequence.length = seqlength,
               mut.rate = round(exp(-rnorm(1,mean = 19,sd = 1)), 11)
  ),
  fscBlock_dna(sequence.length = seqlength,
               mut.rate = round(exp(-rnorm(1,mean = 19,sd = 1)), 11)
  )
)

it is exactly the same. It is basically the code from the original post. But when I change nloci to 1 it works with all the methods.

konopinski commented 2 years ago

and the expected outcome just for one dna block:


<<< strataG.fsc >>>

Contents: 300 samples, 1 locus, 3 strata

Strata summary:
  stratum num.ind num.missing num.haplotypes
1  Deme.1     100           0              1
2  Deme.2     100           0              1
3  Deme.3     100           0              2

Sequence summary:
     locus num.seqs mean.length mean.num.ns mean.num.indels
1 C1B1_DNA        2         150           0               0

I used this code in the publication yet to be accepted and it gave me whatever I needed. This time I wanted to simulate shotgun sequencing (such as RAD on I;lumina) thus I need to make mutation rates variable and get many small unlinked fragments (actually I should put the fragments on different chromosomes to make them unlinked).

EricArcher commented 2 years ago

Thanks for this. I'll dig into the code to see if there is some reason the multi-block specification works as desired on my system, but not on yours and will let you know what I find...

konopinski commented 2 years ago

But please, do not spend the weekend on it :) It's 1:30AM in Poland so it's time for me too. Thanks a lot for everything you do here. Cheers, Maciek

konopinski commented 2 years ago

Hi, stupid thing... it works. The only thing I've changed in the genetics definition was that I placed the sequences in separate chromosomes to unlink them completely. Like that:

dna.blocks <- lapply(1:nloci, function(i) {
  fscBlock_dna(
    chromosome = i,
    sequence.length = seqlength,
    mut.rate = round(exp(-rnorm(1,mean = 18,sd = 1)), 11)
  )
})
genetics <- do.call(fscSettingsGenetics, dna.blocks)

Maybe that will shed some light on the issue?

EricArcher commented 2 years ago

Would you send me the .rdata file produced by the following code on your machine?

rm(list = ls())
library(strataG)

demes <- fscSettingsDemes(
  fscDeme(1000, 100),
  fscDeme(10000, 100),
  fscDeme(100000, 100),
  ploidy = 1
)

events <- fscSettingsEvents(
  fscEvent(10000, 0, 2, 1, 1, 0, 0),
  fscEvent(10000, 1, 2, 1, 1, 0, 0)
)

dna.blocks <- replicate(
  5, 
  fscBlock_dna(
    sequence.length = 150, 
    mut.rate = round(exp(-rnorm(1, mean = 19, sd = 1)), 11)
  ),
  simplify = FALSE
)
genetics <- do.call(fscSettingsGenetics, dna.blocks)

p <- fscWrite(
  demes, genetics, events = events,
  label = "konopinski.test", 
  use.wd = TRUE
)

p <- fscRun(p, num.sims = 1000, num.cores = 6, seed = 545125)

df <- fscReadArp(p)

save.image("konopinski test.rdata")
konopinski commented 2 years ago

konopinski test.zip Hope this helps. It seems that is is not only my problem - in June there was a similar issue with the same error (reported by akoontz11). I do not know how that one ended because there is no answer to your comment (https://githubhelp.com/EricArcher/strataG/issues/55).

EricArcher commented 2 years ago

Although that was the same error message, that users' issue was different. The code you're using should have the fix for that that I pushed.

Although I still don't see why it worked on my system and not yours previously, I think I've located the source of this issue.

It has to do with the fact that you are simulating multiple blocks of haploid sequences. In developing the code, I'd not tested the fsc2gtypes() conversion for this scenario. I'd only dealt with a single haploid block. gtypes objects can handle multiple loci (using the multidna object), or I can concatenate all sequences. I'll make that an option in fsc2gtypes().

I will make the changes and let you know when I've pushed up a new version.

EricArcher commented 2 years ago

I have pushed a new version that fixes this issue in fsc2gtypes(). I've also added a new concat.dna argument to that function. If TRUE (default), all haploid DNA blocks are concatenated into a single locus. If FALSE, each block is loaded as an independent locus. The behavior is demonstrated below:

rm(list = ls())
library(strataG)

demes <- fscSettingsDemes(
  fscDeme(1000, 100),
  fscDeme(10000, 100),
  fscDeme(100000, 100),
  ploidy = 1
)

events <- fscSettingsEvents(
  fscEvent(10000, 0, 2, 1, 1, 0, 0),
  fscEvent(10000, 1, 2, 1, 1, 0, 0)
)

dna.blocks <- replicate(
  5, 
  fscBlock_dna(
    sequence.length = 150, 
    mut.rate = round(exp(-rnorm(1, mean = 19, sd = 1)), 11)
  ),
  simplify = FALSE
)
genetics <- do.call(fscSettingsGenetics, dna.blocks)

p <- fscWrite(
  demes, genetics, events = events,
  label = "konopinski.test", 
  use.wd = TRUE
)

p <- fscRun(p, num.sims = 1000, num.cores = 6, seed = 545125)

g.concat <- fsc2gtypes(p, concat.dna = TRUE)
g.multi <- fsc2gtypes(p, concat.dna = FALSE)

g.concat
g.multi

Reinstall the package and let me know if it works on your system. Feel free to close the issue if so.

konopinski commented 2 years ago

The case is closed. Although, very slow for a large number of loci it works. Thanks a lot