bioXiaoheng / BalLeRMix

Software package for BalLeRMix and scripts used in the study "Flexible mixture model approaches that accommodate footprint size variability for robust detection of balancing selection" (Cheng & DeGiorgio 2020)
4 stars 1 forks source link

one input file and one spectrum file per chromosome? #5

Closed WeiCSong closed 1 year ago

bioXiaoheng commented 3 years ago

Hi Weichen,

To create a concatenated file, you can use a command line such as: cat <filenames for all chromosomes> | awk '(NR==1 || $1 != "position" )' > <concatenated file>

Computing the spectrum doesn't require the knowledge of positions of sites, hence it is ok to directly concatenate input files for all chromosomes together for BallerMix to run.

When actually running B statistics on genomics data, the genetic positions are important information that (together with the parameter A) decides how closely a site in the window is linked to the test site. Because loci on two different chromosomes are completely unlinked, their input files should not be concatenated. Besides, the discontinuity in the positions when you go from the end of one chromosome to the start of the other will most likely give computational errors.

Hope the above explanation helps.

Best, Xiaoheng

WeiCSong commented 3 years ago

Thanks for your helpful information and apologies for my mistake on the question edit : ) I generated these files as you mentioned (based on 1000G EUR population), and I ran:

python BalLeRMix_v2.3.py -i input/binput.22.txt --spect spect.txt -o result/b2.22.txt

I'm confused by two observations:

Firstly, the log file showed that it is "Skipping" nearly all of the parameter combinations (there are too many lines of "Skipping ...", I failed to count it), but by visual inspection, the spectrum file seems reasonable, as I attached below. I'm not sure whether this indicated an error.

Secondly, the program terminated after just several hundred lines of output, with the following error:

2021-07-14 15:50:55.785500. Start computing likelihood ratios... Computing LR on every 1 site/s, using all the data with alpha >= 1e-8. writing output to result/b2.22.txt Traceback (most recent call last): File "BalLeRMix_v2.3.py", line 772, in main() File "BalLeRMix_v2.3.py", line 764, in main scan(xGrid,AGrid,AdGrid,aGrid,opt.outfile,phys_pos,pos_list,Ks,Ns,numSites,Spec, logSpec, Fx,N, opt.size,opt.r,opt.step,opt.phys,opt.nofreq,opt.MAF,opt.noCenter,opt.Rrate)#, ProbsStar File "BalLeRMix_v2.3.py", line 636, in scan scan_alpha(xGrid,AGrid,AdGrid,aGrid,outfile,phys_pos,pos_list,Ks,Ns,numSites,Spec,logSpec,Fx,N,s,MAF) File "BalLeRMix_v2.3.py", line 601, in scan_alpha scores.write('%s\t%s\t%s\t%s\t%g\t%d\n' % (phys_pos[i], poslist[i], Tmax[0], ','.join(['%g' % () for _ in Tmax[1]]), Tmax[2], optWinSize)) TypeError: 'int' object is not iterable

Do you have any ideas about what's going on? Thanks for your help!

spect.txt binput.22.txt

bioXiaoheng commented 3 years ago

Hi Weichen,

I took a quick look at your spect file and noticed a few things:

  1. You seemed to have trimmed off rare variants (allele count <5) from your input, which could be what some of the "skipping" were about;

  2. The allele counts in your initial input are for minor alleles (i.e., not polarized). Therefore, you should add "--MAF" tag when running the program.

  3. This may be unrelated to running this program, but I do not recommend pooling multiple populations (EUR consists of multiple different European populations) for analysis of balancing selection, because by doing this you're artificially creating population structures that might mislead the inference.

Can you try running it with --MAF and see if the issue persists?

WeiCSong commented 3 years ago

Thanks for you suggestion! I added the --MAF flag and the "skipping" information got much less, but the Type Error still persisted. Seem that the program terminated at the same position.

bioXiaoheng commented 3 years ago

Thanks for you suggestion! I added the --MAF flag and the "skipping" information got much less, but the Type Error still persisted. Seem that the program terminated at the same position.

Hi Weichen,

I sincerely apologize for the much-delayed response---this issue got tabled during the crunch for another deadline, and I wasn't reminded of it until now. Does the problem still persist? If so, at which position did it terminate?

WeiCSong commented 3 years ago

Thanks for your kindness and help! I'm currently using the CEU B2 result and did not further run ballermix on previous data. I haven't made progress on this issue yet.