fastlmm / FaST-LMM

Python version of Factored Spectrally Transformed Linear Mixed Models
https://fastlmm.github.io/
Apache License 2.0
47 stars 11 forks source link

Why did Bed.pos switch from int to float? #10

Closed remomomo closed 3 years ago

remomomo commented 3 years ago

Hi Carl,

I have two different environments set up, one with pysnptools 0.4.11 and one with 0.4.26

when I open the same bed (/fam/bim) plink-file, for the earlier release (0.4.11), the coordinates are integers:

>>> bed.pos[0]
array([    1,     0, 69081])

for the later release (0.4.26), the coordinates are floats, including many nans in the second position of the second dimension:

>>> bed.pos[0]
array([1.0000e+00,        nan, 6.9081e+04])

My question is: is this a bug, or a new feature? In the documentation, it says the array is float (in the first case, it is not), also, why did the genetic distance change from 0 to nan?

I use the coordinates to make intersections based on positions and came accross this behaviour because all of a sudden my intersections did not work anymore...

best, Remo

CarlKCarlK commented 3 years ago

Remo,

Thank you for using PySnpTools! Sorry for the confusion and for not announcing this change on the mailing list.

What happened: Recently, when I created the stand alone bed-readerhttps://pypi.org/project/bed-reader/, I realized the different SnpReadershttps://fastlmm.github.io/PySnpTools/#module-pysnptools.snpreader (Bed, SnpNpzhttps://fastlmm.github.io/PySnpTools/#snpreader-snpnpz, etc) were inconsistent in the dtype of poshttps://fastlmm.github.io/PySnpTools/#pysnptools.snpreader.SnpReader.pos and in how they represented missing values. This made reading one SnpReader and writing to another unpredictable. Now, pos is always an array of float (because genetic distance must be a float) and missing is always NaN.

Work arounds to get integer BP positions with 0 as missing:

import numpy as np from pysnptools.snpreader import Bed from pysnptools.util import example_file # Download and return local file name bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.",".bed")

snp_on_disk = Bed(bedfile,count_A1=False) bp_as_int = np.array(snp_on_disk.pos[:,2],dtype='int') # convert to int bp_as_int[snp_on_disk.pos[:,2]!=snp_on_disk.pos[:,2]]=0 # convert missing to 0

bp_as_int Outputs: array([ 0, 1, 4, ..., 50, 51, 52])

import numpy as np from bed_reader import open_bed

with open_bed(bedfile) as bed: bp_as_int = bed.bp_position bp_as_int

Outputs: array([ 0, 1, 4, ..., 50, 51, 52])

Finally,

If you are intersecting sets of integers, let me put in a plug for a sub-library of PySnpTools called IntRangeSet (https://fastlmm.github.io/PySnpTools/#util-intrangeset). It efficiently works with set of large integers while representing them internally as sorted sets of integer ranges.

This can make things easier to read and may be more efficient. Here is an example comparing IntRangeSet to the default set:

from pysnptools.util import IntRangeSet print(IntRangeSet(bp_as_int)) print(set(bp_as_int))

Outputs: IntRangeSet('0:64') {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}

Thanks again for using our tools. Yours,

Carl

Carl Kadie, Ph.D. FaST-LMM & PySnpTools Teamhttps://fastlmm.github.io/ (Microsoft Research, retired) https://www.linkedin.com/in/carlk/

Join the FaST-LMM user discussion and announcement list via emailmailto:fastlmm-user-join@python.org?subject=Subscribe (or use web sign uphttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman3%2Flists%2Ffastlmm-user.python.org&data=02%7C01%7C%7C13a5c33d7cd84cad5cdf08d7bba56e20%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637184191498409587&sdata=2CQWjQEwOpQol2rQ1eoyVTgY8WvInV8UH31Wtl68FzY%3D&reserved=0)

From: remomomo notifications@github.com Sent: Tuesday, December 01, 2020 2:09 AM To: fastlmm/FaST-LMM FaST-LMM@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [fastlmm/FaST-LMM] Why did Bed.pos switch from int to float? (#10)

Hi Carl,

I have two different environments set up, one with pysnptools 0.4.11 and one with 0.4.26

when I open the same bed (/fam/bim) plink-file, for the earlier release (0.4.11), the coordinates are integers:

bed.pos[0]

array([ 1, 0, 69081])

for the later release (0.4.26), the coordinates are floats, including many nans in the second position of the second dimension:

bed.pos[0]

array([1.0000e+00, nan, 6.9081e+04])

My question is: is this a bug, or a new feature? In the documentation, it says the array is float (in the first case, it is not), also, why did the genetic distance change from 0 to nan?

I use the coordinates to make intersections based on positions and came accross this behaviour because all of a sudden my intersections did not work anymore...

best, Remo

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastlmm%2FFaST-LMM%2Fissues%2F10&data=04%7C01%7C%7Cccb76297322f4b0d9fb508d895e1215f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637424141437605900%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9GjM41QqZIfw5EcCsgmSGpI14vlE2TxDmmPty4JMA3g%3D&reserved=0, or unsubscribehttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P7MHC5NLTEIJWYBYLTSSS6D5ANCNFSM4UIZUJFA&data=04%7C01%7C%7Cccb76297322f4b0d9fb508d895e1215f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637424141437605900%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hjv0GyBUWtlZ4kVZlnXyg9dtVGCLK4YYa6djkC7wf38%3D&reserved=0.