alexlancaster / pypop

PyPop: Python for Population Genomics
http://pypop.org
GNU General Public License v2.0
22 stars 15 forks source link

2-locus LD statistics are not calculated in the TSV output files #162

Closed nabil161289 closed 10 months ago

nabil161289 commented 10 months ago

To reproduce the issue

  1. pip install pypop-genomics
  2. pypop --enable-tsv -c minimal.ini USAFEL-UchiTelle-small.pop

Relevant log output

No response

Expected behavior

I expected for each 2-locus haplotype to see ld.d and ld.dprime, ld-chisq, obs.freq, and exp (expected frequencies), and a p-value for testing. However, in the 2-locus-haplo file ld.d, ld.dprime, ld.chisq, and exp are showing ****. Furthermore, observed frequencies are duplicated, once named as allele.freq, and another as obs.freq. Finally, haplotypes are named "alleles" and their frequencies as "allele.freq".

Version

pypop version 1.0.0b7 python version 3.11.5 Device: HP EliteBook 840 G2 Notebook OS: Windows 10 pro

How did you download the software?

Install via pip install pypop-genomics

Additional context

USAFEL-UchiTelle-small.zip

Data upload

alexlancaster commented 10 months ago

Hi @nabil161289

Good catch!

I expected for each 2-locus haplotype to see ld.d and ld.dprime, ld-chisq, obs.freq, and exp (expected frequencies), and a p-value for testing. However, in the 2-locus-haplo file ld.d, ld.dprime, ld.chisq, and exp are showing ****.

Yes, this is indeed a bug - it looks like they got dropped when we switched the genotype separator from : to ~. I know what's causing it - and I'm working on a fix. I'll post some Python wheels for testing here (or a link to them) when I have some for you to test.

Furthermore, observed frequencies are duplicated, once named as allele.freq, and another as obs.freq. Finally, haplotypes are named "alleles" and their frequencies as "allele.freq".

This is more of an enhancement, than a bug per se, but we should probably keep it in the issue. But I agree it's confusing to call them allele when it's really a haplotype. The reasons it was originally done this way are a historical hack - basically so some R programs could re-use the same code for making pairwise allele comparisons and pairwise LD. There's no reason to keep them the same now, though, so we should probably relabel all the columns in *-locus-haplo.tsv to more accurately reflect their contents.

@rsingle : since changing these columns could theoretically break someone's downstream pipeline (are you still relying on these column names for your R scripts?), I think it's better to get this into 1.0.0, so we don't have to do a major API breaking release so soon after releasing 1.0.0 :) Also we've already breaking with 0.7.0 behaviour by changing the suffix to .tsv and some other changes.

alexlancaster commented 10 months ago

Hi @nabil161289 try this wheel, which should match your platform (Python 3.11/Windows):

wheel.zip

Download it and then unzip and install:

unzip wheel.zip
pip install pypop_genomics-1.0.0rc1.post5+g280920f-cp311-cp311-win_amd64.whl

This currently just fixes the missing data. That's the critical piece. I'll work on renaming the columns soon, needs some more discussion.

nabil161289 commented 10 months ago

Hi @alexlancaster, thanks a lot it worked really well! I followed those steps and I can now see the LD stats. But there are similar problems with the other .tsv files:

  1. P-values and the Z statistic of the LRT are not calculated in the 2-locus summary file (columns lrt.pval and lrt.z), Showing **** as well.
  2. In the 2-locus haplotype, p-values are not calculated, only the chi-squared statistic (it's on 1 degrees of freedom I guess? I think that could be a suggestion).
alexlancaster commented 10 months ago

Hi @alexlancaster, thanks a lot it worked really well! I followed those steps and I can now see the LD stats. But there are similar problems with the other .tsv files:

  1. P-values and the Z statistic of the LRT are not calculated in the 2-locus summary file (columns lrt.pval and lrt.z), Showing **** as well.

If you can make a complete list of everything missing that will help me fix everything in one go.

  1. In the 2-locus haplotype, p-values are not calculated, only the chi-squared statistic (it's on 1 degrees of freedom I guess? I think that could be a suggestion).

There is an option to use the permutation test to get p-values, but it's computationally expensive, and I haven't tested it for a while. @rsingle may have some more thoughts on that too. And yes that's more of an enhancement if there isn't already a column.

nabil161289 commented 10 months ago

Hi @alexlancaster, sorry for inconvenience, will do that and check everything at once henceforth.

alexlancaster commented 10 months ago
  1. P-values and the Z statistic of the LRT are not calculated in the 2-locus summary file (columns lrt.pval and lrt.z), Showing **** as well.

I double checked that those two specific columns will be populated if you make the allPairwiseLDWithPermu option non-zero, i.e.

[Emphaplofreq]
allPairwiseLD=1
allPairwiseLDWithPermu=1000

The number represents the number of permutations, it should be at least 1000 or greater. Note that this will make your run-time much longer if you have a highly polymorphic data set with many rows and loci.

  1. In the 2-locus haplotype, p-values are not calculated, only the chi-squared statistic (it's on 1 degrees of freedom I guess? I think that could be a suggestion).

So hypothetically you could compute a p-value for each individual haplotype, from a chisq, but would need to check what the right dof and whether that's meaningful due to multiple comparisons issue. That could be simple to implement downstream of PyPop now that you have the chisq value. Either way it would be an enhancement.

alexlancaster commented 10 months ago
  1. P-values and the Z statistic of the LRT are not calculated in the 2-locus summary file (columns lrt.pval and lrt.z), Showing **** as well.

The number represents the number of permutations, it should be at least 1000 or greater. Note that this will make your run-time much longer if you have a highly polymorphic data set with many rows and loci.

hi @nabil161289 here is a new wheel:

wheel.zip

that includes a new unit test for the permutation and also adds back p-value and permutation information to the plain text output. You can run the same sample data using the minimal-with-permu.ini file (essentially adding in the line above): https://github.com/alexlancaster/pypop/blob/162-2-locus-ld-statistics-are-not-calculated-in-the-tsv-output-files/tests/data/minimal-with-permu.ini

Once you've verified that this works, we'll next work on renaming the columns headers with more intuitive names.

  1. In the 2-locus haplotype, p-values are not calculated, only the chi-squared statistic (it's on 1 degrees of freedom I guess? I think that could be a suggestion).

So hypothetically you could compute a p-value for each individual haplotype, from a chisq, but would need to check what the right dof and whether that's meaningful due to multiple comparisons issue. That could be simple to implement downstream of PyPop now that you have the chisq value. Either way it would be an enhancement.

The chi-square values are not intended to be used to generate p-values that can be interpreted to give you statistical significance. They are only there to give you a rough idea of deviation from expectation. With so many haplotypes being tested, it means that you would need to do massive multiple testing correction in any case. So we emphatically do not recommend computing or using p-values computed from the chi-square values for individual haplotypes as they are not meaningful. You can get statistical significance, however at the locus-level using the permutation approach described above, but not for individual haplotypes.

alexlancaster commented 10 months ago

@nabil161289 merged in changes in #164 that will rename columns to be more intuitive, and also add the ALD measures:

This should be available shortly as 1.0.0rc2, but if you want to test before can get the wheels from the current build system, if you visit: https://github.com/alexlancaster/pypop/actions/runs/6752944169 and download the artifact.zip file, unzip and install the relevant wheel. I think you have to be logged into github to see the downloadable file.

nabil161289 commented 10 months ago

Hi @alexlancaster and thanks for the new wheel. I applied the new changes on the minimal.ini also and looks very good.

  • rename 2-locus-haplo.tsv columns more accurately: allele -> haplotype, exp -> haplotype.no-ld.count
  • add ALD measures in 2-locus-summary.tsv: columns are ald.1_2 and ald.2_1
  • remove obs and obs.freq columns which were duplicative of haplotype.count and haplotype.freq, respectively
  • rename columns for multilocus haplotypes for 3 or more loci, allele -> haplotype and locus -> loci

Concerning those changes, I downloaded the artifact.zip but I can't determine which wheel precisely.

alexlancaster commented 10 months ago
  • rename 2-locus-haplo.tsv columns more accurately: allele -> haplotype, exp -> haplotype.no-ld.count
  • add ALD measures in 2-locus-summary.tsv: columns are ald.1_2 and ald.2_1
  • remove obs and obs.freq columns which were duplicative of haplotype.count and haplotype.freq, respectively
  • rename columns for multilocus haplotypes for 3 or more loci, allele -> haplotype and locus -> loci

Concerning those changes, I downloaded the artifact.zip but I can't determine which wheel precisely.

@nabil161289 I made a new 1.0.0rc2 release and it's up on pypi so you can just:

pip install -U pypop-genomics

In the future if you're installing a wheel manually from artifact.zip you should look for the one that matches your OS and Python version.

So in your case it's the wheel that has the tags cp311 (CPython 3.11) and win_ (for Windows) in the filename. In this case there's only one file that matches, since there's only one Windows architecture it builds on (amd64).

nabil161289 commented 10 months ago

@alexlancaster thanks a lot! I installed the new update and now looks amazing! Thanks a lot Alex for bearing with me. Cheers!

nabil161289 commented 10 months ago

and many thanks to @rsingle too!

alexlancaster commented 10 months ago

In the future if you're installing a wheel manually from artifact.zip you should look for the one that matches your OS and Python version.

So in your case it's the wheel that has the tags cp311 (CPython 3.11) and win_ (for Windows) in the filename. In this case there's only one file that matches, since there's only one Windows architecture it builds on (amd64).

@nabil161289 I discovered an even easier workaround for installing from a list of local wheels contained in the artifact.zip:

mkdir /tmp/wheels
cd /tmp/wheels
unzip /path/to/artifact.zip
pip install --only-binary :all: --find-links . pypop-genomics

What the last command does is force pip to look for the package in the current (.) directory, i.e. --find-links . and force it install a binary wheel (i.e. not any source packages that might also be present in the directory) --only-binary :all:. pip will automatically pick the "right" wheel from the directory and install it. (Note that you still have to supply the package name: pypop-genomics).

nabil161289 commented 10 months ago

Hi @alexlancaster, that's really good to hear and will really help in the future! So "--only-binary :all: --find-links" forces pip to pick the right wheel given the OS and python version?

alexlancaster commented 10 months ago

Hi @alexlancaster, that's really good to hear and will really help in the future! So "--only-binary :all: --find-links" forces pip to pick the right wheel given the OS and python version?

Yes the argument to --find-links can be either a directory or a URL. Run pip install --help for a full description of options.