bensutherland / simple_pop_stats

A short analysis of population statistics given specific inputs
5 stars 1 forks source link

Error at diploid loci in 100% simulations #7

Closed bensutherland closed 4 years ago

bensutherland commented 4 years ago

Issue with 100% simulations, seems to be due to record names such as allele 1 ("ots_epic4_158_1") and allele 2 ("ots_epic4_158_1_1"). Potentially needs to be corrected in terms of marker names, ensuring that marker names do not have the _1 a the end. Not clear how to resolve this in the meantime, other than deleting the offending record.

full_sim(rubias_base.FN = "S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt"
+         , num_sim_ind = 200, sim_reps = 100
+         )
Parsed with column specification:
cols(
  .default = col_double(),
  sample_type = col_character(),
  collection = col_character(),
  repunit = col_character(),
  indiv = col_character(),
  ots_epic4_158_1 = col_logical(),
  ots_epic4_158_1_1 = col_logical()
)
See spec(...) for full column specifications.
Warning: 3 parsing failures.
 row               col           expected actual                                                                                                                  file
1234 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE      2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'
1243 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE      2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'
1269 ots_epic4_158_1_1 1/0/T/F/TRUE/FALSE      2 'S:/01_chinook/PBT/2020/00_analysis_only/yukon_SNP_bjgs_2020-03-11/bch_4977_ind_53_pop_390_amp_Rubias_2020-03-11.txt'

[1] "Generating counts per collection and per repunit"
[1] "Performing 100% simulation"
[1] "Running simulation and assessing the simulation assignments"
[1] "This will include a total of ***53*** scenarios"
Error in input.  At diploid loci, either both or neither gene copies must be missing. Offending locus = ots_epic4_158_1
    Note! This might indicate that the gen_start_col is incorrect.
 Show Traceback

 Rerun with Debug
 Error in get_ploidy_from_frame(tmp) : 
  Bailing out due to single gene copies being missing data at non-haploid loci. 
erondeau commented 4 years ago

This is a result of missing data, and default behavior of read_tsv in the readr package. Stumbled across it independently of above error, so can't guarantee it is the exact same behaviour, but seems likely.

Note:
?read_tsv: "col_types
One of NULL, a cols() specification, or a string. See vignette("readr") for more details.If NULL, all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.

So, it looks at the first 1000 rows. If it doesn't see any genotypes (or many?) it calls it a logical due to the frequency of NAs. This is likely indicative of poor markers, collection-specific NAs, or panel-version specific NAs. Either way, the logical throws an issue.

Two potential ways to fix this a) explicitly define columns (eg. all character on input) or b) look at more rows before guessing. I've chosen option b to address it bu increassing guess_max to 100000 by default: https://github.com/bensutherland/simple_pop_stats/commit/998fd8ef238af50c2697d06cea7dd3b4c66496ff