Closed nabil161289 closed 10 months ago
Hi @nabil161289
Good catch!
I expected for each 2-locus haplotype to see ld.d and ld.dprime, ld-chisq, obs.freq, and exp (expected frequencies), and a p-value for testing. However, in the 2-locus-haplo file ld.d, ld.dprime, ld.chisq, and exp are showing ****.
Yes, this is indeed a bug - it looks like they got dropped when we switched the genotype separator from :
to ~
. I know what's causing it - and I'm working on a fix. I'll post some Python wheels for testing here (or a link to them) when I have some for you to test.
Furthermore, observed frequencies are duplicated, once named as allele.freq, and another as obs.freq. Finally, haplotypes are named "alleles" and their frequencies as "allele.freq".
This is more of an enhancement, than a bug per se, but we should probably keep it in the issue. But I agree it's confusing to call them allele
when it's really a haplotype
. The reasons it was originally done this way are a historical hack - basically so some R programs could re-use the same code for making pairwise allele comparisons and pairwise LD. There's no reason to keep them the same now, though, so we should probably relabel all the columns in *-locus-haplo.tsv
to more accurately reflect their contents.
@rsingle : since changing these columns could theoretically break someone's downstream pipeline (are you still relying on these column names for your R scripts?), I think it's better to get this into 1.0.0, so we don't have to do a major API breaking release so soon after releasing 1.0.0 :) Also we've already breaking with 0.7.0 behaviour by changing the suffix to .tsv
and some other changes.
Hi @nabil161289 try this wheel, which should match your platform (Python 3.11/Windows):
Download it and then unzip and install:
unzip wheel.zip
pip install pypop_genomics-1.0.0rc1.post5+g280920f-cp311-cp311-win_amd64.whl
This currently just fixes the missing data. That's the critical piece. I'll work on renaming the columns soon, needs some more discussion.
Hi @alexlancaster, thanks a lot it worked really well! I followed those steps and I can now see the LD stats. But there are similar problems with the other .tsv files:
Hi @alexlancaster, thanks a lot it worked really well! I followed those steps and I can now see the LD stats. But there are similar problems with the other .tsv files:
- P-values and the Z statistic of the LRT are not calculated in the 2-locus summary file (columns lrt.pval and lrt.z), Showing **** as well.
If you can make a complete list of everything missing that will help me fix everything in one go.
- In the 2-locus haplotype, p-values are not calculated, only the chi-squared statistic (it's on 1 degrees of freedom I guess? I think that could be a suggestion).
There is an option to use the permutation test to get p-values, but it's computationally expensive, and I haven't tested it for a while. @rsingle may have some more thoughts on that too. And yes that's more of an enhancement if there isn't already a column.
Hi @alexlancaster, sorry for inconvenience, will do that and check everything at once henceforth.
- P-values and the Z statistic of the LRT are not calculated in the 2-locus summary file (columns lrt.pval and lrt.z), Showing **** as well.
I double checked that those two specific columns will be populated if you make the allPairwiseLDWithPermu
option non-zero, i.e.
[Emphaplofreq]
allPairwiseLD=1
allPairwiseLDWithPermu=1000
The number represents the number of permutations, it should be at least 1000 or greater. Note that this will make your run-time much longer if you have a highly polymorphic data set with many rows and loci.
- In the 2-locus haplotype, p-values are not calculated, only the chi-squared statistic (it's on 1 degrees of freedom I guess? I think that could be a suggestion).
So hypothetically you could compute a p-value for each individual haplotype, from a chisq, but would need to check what the right dof and whether that's meaningful due to multiple comparisons issue. That could be simple to implement downstream of PyPop now that you have the chisq value. Either way it would be an enhancement.
- P-values and the Z statistic of the LRT are not calculated in the 2-locus summary file (columns lrt.pval and lrt.z), Showing **** as well.
The number represents the number of permutations, it should be at least 1000 or greater. Note that this will make your run-time much longer if you have a highly polymorphic data set with many rows and loci.
hi @nabil161289 here is a new wheel:
that includes a new unit test for the permutation and also adds back p-value and permutation information to the plain text output. You can run the same sample data using the minimal-with-permu.ini
file (essentially adding in the line above): https://github.com/alexlancaster/pypop/blob/162-2-locus-ld-statistics-are-not-calculated-in-the-tsv-output-files/tests/data/minimal-with-permu.ini
Once you've verified that this works, we'll next work on renaming the columns headers with more intuitive names.
- In the 2-locus haplotype, p-values are not calculated, only the chi-squared statistic (it's on 1 degrees of freedom I guess? I think that could be a suggestion).
So hypothetically you could compute a p-value for each individual haplotype, from a chisq, but would need to check what the right dof and whether that's meaningful due to multiple comparisons issue. That could be simple to implement downstream of PyPop now that you have the chisq value. Either way it would be an enhancement.
The chi-square values are not intended to be used to generate p-values that can be interpreted to give you statistical significance. They are only there to give you a rough idea of deviation from expectation. With so many haplotypes being tested, it means that you would need to do massive multiple testing correction in any case. So we emphatically do not recommend computing or using p-values computed from the chi-square values for individual haplotypes as they are not meaningful. You can get statistical significance, however at the locus-level using the permutation approach described above, but not for individual haplotypes.
@nabil161289 merged in changes in #164 that will rename columns to be more intuitive, and also add the ALD measures:
rename 2-locus-haplo.tsv
columns more accurately: allele
-> haplotype
, exp
-> haplotype.no-ld.count
add ALD measures in 2-locus-summary.tsv
: columns are ald.1_2
and ald.2_1
remove obs
and obs.freq
columns which were duplicative of haplotype.count
and haplotype.freq
, respectively
rename columns for multilocus haplotypes for 3 or more loci, allele
-> haplotype
and locus
-> loci
This should be available shortly as 1.0.0rc2, but if you want to test before can get the wheels from the current build system, if you visit: https://github.com/alexlancaster/pypop/actions/runs/6752944169 and download the artifact.zip
file, unzip and install the relevant wheel. I think you have to be logged into github to see the downloadable file.
Hi @alexlancaster and thanks for the new wheel. I applied the new changes on the minimal.ini also and looks very good.
- rename
2-locus-haplo.tsv
columns more accurately:allele
->haplotype
,exp
->haplotype.no-ld.count
- add ALD measures in
2-locus-summary.tsv
: columns areald.1_2
andald.2_1
- remove
obs
andobs.freq
columns which were duplicative ofhaplotype.count
andhaplotype.freq
, respectively- rename columns for multilocus haplotypes for 3 or more loci,
allele
->haplotype
andlocus
->loci
Concerning those changes, I downloaded the artifact.zip but I can't determine which wheel precisely.
- rename
2-locus-haplo.tsv
columns more accurately:allele
->haplotype
,exp
->haplotype.no-ld.count
- add ALD measures in
2-locus-summary.tsv
: columns areald.1_2
andald.2_1
- remove
obs
andobs.freq
columns which were duplicative ofhaplotype.count
andhaplotype.freq
, respectively- rename columns for multilocus haplotypes for 3 or more loci,
allele
->haplotype
andlocus
->loci
Concerning those changes, I downloaded the artifact.zip but I can't determine which wheel precisely.
@nabil161289 I made a new 1.0.0rc2 release and it's up on pypi so you can just:
pip install -U pypop-genomics
In the future if you're installing a wheel manually from artifact.zip
you should look for the one that matches your OS and Python version.
So in your case it's the wheel that has the tags cp311
(CPython 3.11) and win_
(for Windows) in the filename. In this case there's only one file that matches, since there's only one Windows architecture it builds on (amd64
).
@alexlancaster thanks a lot! I installed the new update and now looks amazing! Thanks a lot Alex for bearing with me. Cheers!
and many thanks to @rsingle too!
In the future if you're installing a wheel manually from
artifact.zip
you should look for the one that matches your OS and Python version.So in your case it's the wheel that has the tags
cp311
(CPython 3.11) andwin_
(for Windows) in the filename. In this case there's only one file that matches, since there's only one Windows architecture it builds on (amd64
).
@nabil161289 I discovered an even easier workaround for installing from a list of local wheels contained in the artifact.zip
:
mkdir /tmp/wheels
cd /tmp/wheels
unzip /path/to/artifact.zip
pip install --only-binary :all: --find-links . pypop-genomics
What the last command does is force pip
to look for the package in the current (.
) directory, i.e. --find-links .
and force it install a binary wheel (i.e. not any source packages that might also be present in the directory) --only-binary :all:
. pip
will automatically pick the "right" wheel from the directory and install it. (Note that you still have to supply the package name: pypop-genomics
).
Hi @alexlancaster, that's really good to hear and will really help in the future! So "--only-binary :all: --find-links" forces pip to pick the right wheel given the OS and python version?
Hi @alexlancaster, that's really good to hear and will really help in the future! So "--only-binary :all: --find-links" forces pip to pick the right wheel given the OS and python version?
Yes the argument to --find-links
can be either a directory or a URL. Run pip install --help
for a full description of options.
To reproduce the issue
Relevant log output
No response
Expected behavior
I expected for each 2-locus haplotype to see ld.d and ld.dprime, ld-chisq, obs.freq, and exp (expected frequencies), and a p-value for testing. However, in the 2-locus-haplo file ld.d, ld.dprime, ld.chisq, and exp are showing ****. Furthermore, observed frequencies are duplicated, once named as allele.freq, and another as obs.freq. Finally, haplotypes are named "alleles" and their frequencies as "allele.freq".
Version
pypop version 1.0.0b7 python version 3.11.5 Device: HP EliteBook 840 G2 Notebook OS: Windows 10 pro
How did you download the software?
Install via
pip install pypop-genomics
Additional context
USAFEL-UchiTelle-small.zip
Data upload
.ini
and.pop
files sufficient to reproduce the bug.