Closed sjmack closed 7 years ago
This is actually a more pressing issue than Issue #18, as the first allele is left-truncated, so that the allele-family and protein fields are being truncated.
Also, allele-lengths appear to be properly accommodated by the Chen/Diff test results:
3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error: 0
p-value (overall): 0.0000*****
Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03:02:12+03:02:12 (18/8.818363) 0.0037** 0.0037**
05:03:01:01+03:02:12 (0/18.105788) 0.0001**** 0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****
06:02:01+05:03:01:01 (0/23.499002) 0.0000***** 0.0000*****
06:02:01+06:02:01 (27/14.854291) 0.0000***** 0.0000*****
06:03:01+04:02:01 (8/0.287425) 0.0000***** 0.0000*****
06:05:01+05:03:01:01 (0/18.105788) 0.0000***** 0.0000*****
06:05:01+06:05:01 (18/8.818363) 0.0019** 0.0019**
3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****
Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03:02:12+03:02:12 (18/8.818363) 0.0017** 0.0017**
05:03:01:01+03:02:12 (0/18.105788) 0.0000***** 0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****
06:02:01+05:03:01:01 (0/23.499002) 0.0000***** 0.0000*****
06:02:01+06:02:01 (27/14.854291) 0.0007*** 0.0007***
06:03:01+04:02:01 (8/0.287425) 0.0000***** 0.0000*****
06:05:01+05:03:01:01 (0/18.105788) 0.0000**** 0.0000****
06:05:01+06:05:01 (18/8.818363) 0.0016** 0.0016**
Although these Chen/diff results would be easier to read if they were structured more like the other HW results:
03:02:12+03:02:12 (18/8.818363) 0.0037** 0.0037**
05:03:01:01+03:02:12 (0/18.105788) 0.0001**** 0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****
However, this latter would be a "nice to have".
Got it, so we'll make this dynamic too. How often is the plain text output used these days? The TSV/CSV output obviously doesn't have this problem since there is no attempt to format things. Everything is always generated from the complete "output of record" (which is the XML), i.e. both the TSV/CSV files and the plain text are derived from the XML, and so can be regenerated at any time without having to redo the analysis.
In case you want to try this, you just need the xsltproc
tool to be installed (part of libxml2
) and you can run the following to regenerate the text output:
xsltproc xslt/text.xsl BIGDAWG_SynthControl_Data-out.xml > BIGDAWG_SynthControl_Data-out.txt
git pull
again. I changed this to use the maximum possible genotype length across the whole input XML and use this as the initial column width for the common genotype (rather than be locus-specific). Currently it is globally hardcoded anyway. Try it on a few files.
Common Genotypes look good, but as I commented in Issue #18, the common genotypes are right aligned, which looks weird.
I can't really say how frequently PyPop users are relying on the .txt output as opposed to the .xml output.
Personally, I always look at the .txt output first, because I want to know if I need to rerun the data because I set something wrong, or need to change a setting. I use popmeta to generate .dat files from the .xml output, but then I have to look through each .dat file to see what's what.
I do know that I get questions from the community asking how to get the d'ij values out of PyPop; I direct the questioners to the .xml and popmeta, so that suggests to me that many others are relying primarily on the .txt outputs.
Great, if this is fixed from your POV, please close. I opened up a new issue on the Chen/diff stat alignment issue in #20.
There is also the --generate-tsv
command line option to pypop.py
that generates the .dat
files and the .txt
and .xml
files all in one hit. Probably should document this a bit more (feel free to open up an issue on this too).
Okay. I'll close. I hadn't been using --generate-tsv, but it is good to be reminded of it.
Hi @sjmack , in testing the new code with dynamic columns based on the allele lengths, I noticed that if the allele names were short it could sometimes cutoff the labels (using the USAFEL example). Try running the USAFEL example before you update to see what I mean. (I'm going to move this to the test suite now so we catch this).
In any case, I just committed a fix to master that should address these, try running again. You might want to look at the output to make sure it looks OK. I had to hardcode the dashed row separator as being 90 characters. it could be made dynamic, but it doesn't seem to me to be a big priority.
Okay. I see.
./bin/pypop.py -c data/samples/minimal.ini data/samples/USAFEL-UchiTelle.pop
Generated results like this before a pull:
Observed Expected Chi-square DoF p-value
------------------------------------------------------------------------------
Common Too many parameters for chi-square test.
------------------------------------------------------------------------------
d genotypes Value not calculated.
------------------------------------------------------------------------------
on + lumped Value not calculated.
------------------------------------------------------------------------------
homozygotes 17 9.82 5.25 1 0.0219*
------------------------------------------------------------------------------
terozygotes 30 37.18 1.39 1 0.2389
------------------------------------------------------------------------------
Common heterozygotes by allele
0202 14 15.74 0.19 0.6602
0301 11 9.71 0.17 0.6796
03032 14 19.66 1.63 0.2018
0304 4 7.32 1.51 0.2199
05031 6 5.62 0.03 0.8716
06011 11 16.31 1.73 0.1887
------------------------------------------------------------------------------
Common genotypes
0202+03032 4 5.96 0.64 0.4226
03032+06011 3 6.26 1.69 0.1931
Total 7 12.21
------------------------------------------------------------------------------
and ./bin/pypop.py -c tests/data/minimal.ini data/samples/USAFEL-UchiTelle.pop
generated these results with the current master branch:
Observed Expected Chi-square DoF p-value
------------------------------------------------------------------------------------------
Common Too many parameters for chi-square test.
------------------------------------------------------------------------------------------
Lumped genotypes Value not calculated.
------------------------------------------------------------------------------------------
Common + lumped Value not calculated.
------------------------------------------------------------------------------------------
All homozygotes 17 9.82 5.25 1 0.0219*
------------------------------------------------------------------------------------------
All heterozygotes 30 37.18 1.39 1 0.2389
------------------------------------------------------------------------------------------
Common heterozygotes by allele
0202 14 15.74 0.19 0.6602
0301 11 9.71 0.17 0.6796
03032 14 19.66 1.63 0.2018
0304 4 7.32 1.51 0.2199
05031 6 5.62 0.03 0.8716
06011 11 16.31 1.73 0.1887
------------------------------------------------------------------------------------------
Common genotypes
0202+03032 4 5.96 0.64 0.4226
03032+06011 3 6.26 1.69 0.1931
Total 7 12.21
------------------------------------------------------------------------------------------
I think that those fixes generate acceptable presentations. Nice catch!
Great.
:pypop sjmack$ ./bin/pypop.py -c WS_BDCtrl_Test_HW.ini BIGDAWG_SynthControl_Data.pop
is generating the following output for the Hardy Weinberg common genotypes test:As with Issue #18 the increased size of the colon-delimited allele names is larger than the hard-coded 18 characters allowed for a + delimited allele pair.
With digit delimited allele names, the maximum length for a pair would have been 20 characters [but in retrospect, I think that PyPop is stripping expression variant suffixes like N and L, as those are optional)], so the maximum length would have been 17 characters (01010101+01010101).
As with Issue #18 the size of the Common genotypes allele pair field needs to be increased to at least 25 characters (e.g., 104:01:01:01+06:127:01:01), and may possibly need to be made flexible to accommodate longer allele names.