alexkychen / assignPOP

Population Assignment using Genetic, Non-genetic or Integrated Data in a Machine-learning Framework. Methods in Ecology and Evolution. 2018;9:439–446.
http://alexkychen.github.io/assignPOP/
GNU General Public License v3.0
17 stars 3 forks source link

accuracy.MC error #2

Open rmpeery opened 6 years ago

rmpeery commented 6 years ago

I have successfully analyzed a dataset with 248 individuals and ca. 9700 loci within assignPop using the kfold method, however, the MC method is giving an error that I'm hoping you can help with. I've tried using the information from the vignette as a starting point: assign.MC(referenceAlleles, train.inds=c(10, 15, 19), train.loci=c(0.1, 0.25, 0.5, 1), loci.sample="fst", iterations=30, multiprocess = TRUE, model="lda", dir="MC/") but when I run accuracy.MC(dir="MC2/") I get the error: Error in [<-.data.frame(*tmp*, i, , value = c(0.333333333333333, 0.235294117647059, : replacement has 6 items, need 7.

Do you have any suggestions on how I can fix the input file so that accuracy.MC will run? I'm running R v. 3.4.2 through R studio v. 1.1.383.

alexkychen commented 6 years ago

Hi, are you trying to run accuracy.MC for your assign.MC results saved in the folder "MC"? If so, you should specify the same folder name when running accuracy.MC. [ i.e., accuracy.MC(dir="MC/") ]. Please let me know if the problem is something else. Thanks :)

rmpeery commented 6 years ago

I do not think the problem is specification of the directory. Here is my exact code and the terminal output. I think that assign.MC is working because check.loci works, but the accuracy.MC command is not. The input file does not seem to be an issue because analysis through the kfold method works as expected.

assign.MC(referenceAlleles, train.inds=c(10, 15, 19), train.loci=c(0.1, 0.25, 0.5, 1), loci.sample="fst", iterations=30, multiprocess = TRUE, model="lda", dir="MC2/")

3 cores/threads of CPU will be used for analysis... Monte-Carlo cross-validation done!! 360 assignment tests completed!!

accuRes_MC <- accuracy.MC(dir="MC2/") Error in [<-.data.frame(*tmp*, i, , value = c(0.333333333333333, 0.235294117647059,
: replacement has 6 items, need 7

check.loci(dir = "MC2/", top.loci = 100) 3 levels of training individuals are found. Which levels would you like to check? (separate levels by a whitespace if multiple) Options: 10, 15, 19, or all

enter here: all

Results were saved in a 'High_Fst_Locus_Freq.txt' file in the directory.

alexkychen commented 6 years ago

Hi, It looks like the problem may relate to those "Out_xx_xx_xx.txt" files in the folder. Could you copy and paste the first few rows of data, including column name from any of those files? How many populations you have? If you have 3 populations, you should see 6 columns in your data. Columns are separated by space. Thanks.

rmpeery commented 6 years ago

There are 7 reference/training populations.

Ind.ID origin.pop pred.pop pop.1 pop.2 pop.3 pop.4 pop.5 pop.6 pop.7 Ind1 pop.1 pop.5 4.84706781346196e-54 4.57393762693888e-07 7.29146191461918e-31 3.64814949912892e-33 0.999999542606237 5.22110759170651e-85 1.71017166046101e-29 Ind2 pop.1 pop.4 4.93567729299893e-13 5.65857919194618e-66 9.18114277199036e-43 0.999999999977031 2.24752661106583e-11 5.05244282329466e-109 3.70833483677742e-45 Ind3 pop.1 pop.5 6.88643447363136e-32 1.98753423758658e-07 3.55556571792537e-19 1.62028194813181e-30 0.999999801246576 8.77961552598731e-80 1.81120047917388e-21 Ind4 pop.1 pop.3 3.13597690009681e-31 2.98045586414029e-07 0.99999531868614 8.45345513191409e-42 4.3832682738894e-06 6.45006618152228e-39 3.83912716502444e-17 Ind5 pop.1 pop.2 3.04788250036642e-74 1 9.83211838077908e-54 3.11838856058822e-70 2.71412666369024e-19 2.21277959386241e-116 3.29727224458226e-25 Ind6 pop.1 pop.5 3.19409244680128e-06 1.49086225660614e-29 4.60375458140432e-31 6.5782822333848e-35 0.999996805907426 1.71311828788165e-40 1.27236017210606e-13 Ind7 pop.1 pop.3 3.04101345149839e-44 0.121223029730926 0.878768920984708 2.67448709614046e-59 1.19263049028514e-25 2.52802861980595e-80 8.04928436651628e-06

alexkychen commented 6 years ago

Your populations, sample ID, and column names seem to be correct. I've manipulated my data to run the accuracy.MC, but still have trouble to generate the error message like you have. Do you mind to send me your zipped MC2 folder? I can run it from my side and take a closer look where is the problem. Please email me at alexkychen@gmail.com. Meanwhile, if you haven't done it, could you download the example data (simGenepop.txt) and give it a quick run on your computer to see if it works? Thanks so much.

allanbcostello commented 6 years ago

I have a similar error message:

assign.MC(SDATA, dir="Result-folder/", train.inds=c(0.5,0.7,0.9),

  • train.loci=c(0.1,0.25,0.5,1), loci.sample="fst", iterations=30,
  • model="svm") 3 cores/threads of CPU will be used for analysis...

Monte-Carlo cross-validation done!! 360 assignment tests completed!!> accuMC <- accuracy.MC( dir = "Result-folder/" ) Error in [<-.data.frame(*tmp*, i, , value = c(0, 0, 0, 0, 0, 0, 0, : replacement has 16 items, need 20

allanbcostello commented 6 years ago

One other thing... my genepop data is in 3 digit fragment size format as opposed to allele number format if that would be relevent. I ran the test data set you mention above and that works just fine.

alexkychen commented 6 years ago

Hi, How many samples/individuals do you have in each of the populations? If one of your populations has only 4 or less individuals, this error could happen when using train.inds=0.9 in your assign.MC analysis. It means no individual from small populations (looks like 4 out of 20 in your case) were assigned to test sets, because 4 x 0.9 is rounded up as 4. To fix it, you could either use a fixed number of training individuals (e.g., train.inds = 3) in assign.MC or increase your sample size in small populations. Some people duplicate individuals in small populations, but that could inflate your results in those populations.
If your issue is something else, please let me know and I will further investigate. Thanks!

PS. Number of digits in your Genepop file should have nothing to do with the function accuracy.MC., but thanks for providing that information. Just be sure to set "haploid=TRUE" in read.Genepop, if your data is haploid data type.

allanbcostello commented 6 years ago

Thanks so much for you response. I do indeed have a few small pops that I will remove and try again. I'll let you know how it goes. Cheers... ABC

On Tue, Mar 20, 2018 at 2:02 PM, Alex Chen notifications@github.com wrote:

Hi, How many samples/individuals do you have in each of the populations? If one of your populations has only 4 or less individuals, this error could happen when using train.inds=0.9 in your assign.MC analysis. It means no individual from small populations (looks like 4 out of 20 in your case) were assigned to test sets, because 4 x 0.9 is rounded up as 4. To fix it, you could either use a fixed number of training individuals (e.g., train.inds = 3) in assign.MC or increase your sample size in small populations. Some people duplicate individuals in small populations, but that could inflate your results in those populations. If your issue is something else, please let me know and I will further investigate. Thanks!

PS. Number of digits in your Genepop file should have nothing to do with the function accuracy.MC., but thanks for providing that information. Just be sure to set "haploid=TRUE" in read.Genepop, if your data is haploid data type.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alexkychen/assignPOP/issues/2#issuecomment-374755336, or mute the thread https://github.com/notifications/unsubscribe-auth/AekAo8aJvQg-Dh--6rKFAbbYV5u1NlHqks5tgW5WgaJpZM4RjTtY .

alexkychen commented 6 years ago

@rmpeery @allanbcostello I have modified the function accuracy.MC and update the package on this Github repo. Now it should be able to handle a population that doesn't have samples in your test sets, meaning that the error you had should disappear. You can update/re-install the package from Github for now, or just copy the function from here, and run it on your machine locally. The official version 1.1.5 will release to CRAN later on. Please let me know if it works or not. Thanks!

TepoltC commented 6 years ago

I got this same error yesterday, and in my case I think it was related to using a dash in one of my population names. (When I relabelled, it ran fine.) Thought I'd add this in case anyone else runs into the same issue.

Thanks for a really nice tool, and for providing such clear instructions!

alexkychen commented 6 years ago

@TepoltC
Thank you for the heads-up. I will take a further look when possible. Also thanks for your nice words!