alexlancaster / pypop

PyPop: Python for Population Genomics
http://pypop.org
GNU General Public License v2.0
22 stars 15 forks source link

HWE output issues: need to accommodate longer colon-delimited allele names #19

Closed sjmack closed 7 years ago

sjmack commented 7 years ago

:pypop sjmack$ ./bin/pypop.py -c WS_BDCtrl_Test_HW.ini BIGDAWG_SynthControl_Data.pop is generating the following output for the Hardy Weinberg common genotypes test:

------------------------------------------------------------------------------
Common genotypes
:01:01+01:01:01:01           7        6.88        0.00        0.9621      
:01:01:01+02:05:01          11       11.76        0.05        0.8241      
:01:01+03:01:01:01          12       12.01        0.00        0.9975      
:01:01:01+03:01:03           9        8.95        0.00        0.9856      
:01:01+24:02:01:01          13       13.00        0.00        0.9989      
:01:01:01+26:01:01           8        7.95        0.00        0.9864      
 01:01:01:01+26:08           9        9.19        0.00        0.9488      
:01:01+29:02:01:02           9        8.95        0.00        0.9856      
:01:01+31:01:02:01           8        7.95        0.00        0.9864      
 01:01:01:01+32:02          16       16.82        0.04        0.8424     

As with Issue #18 the increased size of the colon-delimited allele names is larger than the hard-coded 18 characters allowed for a + delimited allele pair.

With digit delimited allele names, the maximum length for a pair would have been 20 characters [but in retrospect, I think that PyPop is stripping expression variant suffixes like N and L, as those are optional)], so the maximum length would have been 17 characters (01010101+01010101).

As with Issue #18 the size of the Common genotypes allele pair field needs to be increased to at least 25 characters (e.g., 104:01:01:01+06:127:01:01), and may possibly need to be made flexible to accommodate longer allele names.

sjmack commented 7 years ago

This is actually a more pressing issue than Issue #18, as the first allele is left-truncated, so that the allele-family and protein fields are being truncated.

sjmack commented 7 years ago

Also, allele-lengths appear to be properly accommodated by the Chen/Diff test results:

3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error:       0
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03:02:12+03:02:12 (18/8.818363) 0.0037** 0.0037**
05:03:01:01+03:02:12 (0/18.105788) 0.0001**** 0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****
06:02:01+05:03:01:01 (0/23.499002) 0.0000***** 0.0000*****
06:02:01+06:02:01 (27/14.854291) 0.0000***** 0.0000*****
06:03:01+04:02:01 (8/0.287425) 0.0000***** 0.0000*****
06:05:01+05:03:01:01 (0/18.105788) 0.0000***** 0.0000*****
06:05:01+06:05:01 (18/8.818363) 0.0019** 0.0019**

3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03:02:12+03:02:12 (18/8.818363) 0.0017** 0.0017**
05:03:01:01+03:02:12 (0/18.105788) 0.0000***** 0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****
06:02:01+05:03:01:01 (0/23.499002) 0.0000***** 0.0000*****
06:02:01+06:02:01 (27/14.854291) 0.0007*** 0.0007***
06:03:01+04:02:01 (8/0.287425) 0.0000***** 0.0000*****
06:05:01+05:03:01:01 (0/18.105788) 0.0000**** 0.0000****
06:05:01+06:05:01 (18/8.818363) 0.0016** 0.0016**

Although these Chen/diff results would be easier to read if they were structured more like the other HW results:

03:02:12+03:02:12       (18/8.818363) 0.0037**    0.0037**
05:03:01:01+03:02:12    (0/18.105788) 0.0001****  0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****

However, this latter would be a "nice to have".

alexlancaster commented 7 years ago

Got it, so we'll make this dynamic too. How often is the plain text output used these days? The TSV/CSV output obviously doesn't have this problem since there is no attempt to format things. Everything is always generated from the complete "output of record" (which is the XML), i.e. both the TSV/CSV files and the plain text are derived from the XML, and so can be regenerated at any time without having to redo the analysis.

In case you want to try this, you just need the xsltproc tool to be installed (part of libxml2) and you can run the following to regenerate the text output:

 xsltproc xslt/text.xsl BIGDAWG_SynthControl_Data-out.xml > BIGDAWG_SynthControl_Data-out.txt
alexlancaster commented 7 years ago

git pull again. I changed this to use the maximum possible genotype length across the whole input XML and use this as the initial column width for the common genotype (rather than be locus-specific). Currently it is globally hardcoded anyway. Try it on a few files.

sjmack commented 7 years ago

Common Genotypes look good, but as I commented in Issue #18, the common genotypes are right aligned, which looks weird.

I can't really say how frequently PyPop users are relying on the .txt output as opposed to the .xml output.

Personally, I always look at the .txt output first, because I want to know if I need to rerun the data because I set something wrong, or need to change a setting. I use popmeta to generate .dat files from the .xml output, but then I have to look through each .dat file to see what's what.

I do know that I get questions from the community asking how to get the d'ij values out of PyPop; I direct the questioners to the .xml and popmeta, so that suggests to me that many others are relying primarily on the .txt outputs.

alexlancaster commented 7 years ago

Great, if this is fixed from your POV, please close. I opened up a new issue on the Chen/diff stat alignment issue in #20.

There is also the --generate-tsv command line option to pypop.py that generates the .dat files and the .txt and .xml files all in one hit. Probably should document this a bit more (feel free to open up an issue on this too).

sjmack commented 7 years ago

Okay. I'll close. I hadn't been using --generate-tsv, but it is good to be reminded of it.

alexlancaster commented 7 years ago

Hi @sjmack , in testing the new code with dynamic columns based on the allele lengths, I noticed that if the allele names were short it could sometimes cutoff the labels (using the USAFEL example). Try running the USAFEL example before you update to see what I mean. (I'm going to move this to the test suite now so we catch this).

In any case, I just committed a fix to master that should address these, try running again. You might want to look at the output to make sure it looks OK. I had to hardcode the dashed row separator as being 90 characters. it could be made dynamic, but it doesn't seem to me to be a big priority.

sjmack commented 7 years ago

Okay. I see.

./bin/pypop.py -c data/samples/minimal.ini data/samples/USAFEL-UchiTelle.pop

Generated results like this before a pull:

               Observed    Expected  Chi-square   DoF   p-value   
------------------------------------------------------------------------------
     Common Too many parameters for chi-square test.

------------------------------------------------------------------------------
d genotypes Value not calculated.

------------------------------------------------------------------------------
on + lumped Value not calculated.

------------------------------------------------------------------------------
homozygotes          17        9.82        5.25     1  0.0219*     
------------------------------------------------------------------------------
terozygotes          30       37.18        1.39     1  0.2389      
------------------------------------------------------------------------------
Common heterozygotes by allele
0202                 14       15.74        0.19        0.6602      
0301                 11        9.71        0.17        0.6796      
03032                14       19.66        1.63        0.2018      
0304                  4        7.32        1.51        0.2199      
05031                 6        5.62        0.03        0.8716      
06011                11       16.31        1.73        0.1887      

------------------------------------------------------------------------------
Common genotypes
0202+03032            4        5.96        0.64        0.4226      
03032+06011           3        6.26        1.69        0.1931      
      Total           7       12.21
------------------------------------------------------------------------------

and ./bin/pypop.py -c tests/data/minimal.ini data/samples/USAFEL-UchiTelle.pop

generated these results with the current master branch:

                          Observed         Expected       Chi-square   DoF   p-value        
------------------------------------------------------------------------------------------
           Common Too many parameters for chi-square test.

------------------------------------------------------------------------------------------
 Lumped genotypes Value not calculated.

------------------------------------------------------------------------------------------
  Common + lumped Value not calculated.

------------------------------------------------------------------------------------------
  All homozygotes               17             9.82             5.25     1  0.0219*          
------------------------------------------------------------------------------------------
All heterozygotes               30            37.18             1.39     1  0.2389           
------------------------------------------------------------------------------------------
Common heterozygotes by allele
0202                            14            15.74             0.19        0.6602           
0301                            11             9.71             0.17        0.6796           
03032                           14            19.66             1.63        0.2018           
0304                             4             7.32             1.51        0.2199           
05031                            6             5.62             0.03        0.8716           
06011                           11            16.31             1.73        0.1887           

------------------------------------------------------------------------------------------
Common genotypes
0202+03032                       4             5.96             0.64        0.4226           
03032+06011                      3             6.26             1.69        0.1931           
            Total                7            12.21
------------------------------------------------------------------------------------------

I think that those fixes generate acceptable presentations. Nice catch!

alexlancaster commented 7 years ago

Great.