AlphaGenes / AlphaFamImpute

MIT License
2 stars 2 forks source link

Algorithm breaks when a map is supplied using 1kg_extendend dataset #3

Open Npaffen opened 1 year ago

Npaffen commented 1 year ago

AlphaFamImpute, as all Alphatools who do phasing and imputation I guess, need a map file to recognize multiple chromosomes. I used the 1000 genomes extended dataset featuring around 600 trios added the 23andme data from my family to generate some phasing data to compare this in a plot. In theory this should work with this tool.

Command that was running when the error appeared : AlphaFamImpute -o /media/oem/ssd_01/alphaimp_data/pa_offspring_map -genotype /media/oem/ssd_01/alphaimp_data/genotype_ll.txt -pedigree /media/oem/ssd_01/alphaimp_data/pedigree.txt -iothreads 12 -map /media/oem/ssd_01/Genomics_prac_guide/map/genetic_map_hg38_withX.txt

Error that appeared: /home/oem/.local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /home/oem/.local/lib/python3.10/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Segmentation fault (core dumped)

The last output from Alphafamimpute was (on chr2 I guess): Imputing family of NA18523 and MotherOfNA18521, using 1 high-density offspring.

When I try to run Alphafamimpute without the map on the full chromosome set it works but should thereby provide false results since the algorithm anticipates only one chromosome.

gregorgorjanc commented 1 year ago

@Npaffen thanks for reporting this!

I have never ran the Alpha tools with more than one chromosome and I think this is the mode we assume, but should definitely allow for! Is this where the error is coming from? Please test and let us know.

Npaffen commented 1 year ago

I don't know if the mode is the error since it looks like that the first chromosome seems to run without problems. The second then also looks fine until a certain sample, the one I mentioned in the end of my report. Could you look into this? My python knowledge is only mediocre but it looks like either something with the data or with the algorithm produces an empty vector/matrix? I will try to run AlphaFamImpute again using only the second chromosome of the data. If this works I will try to run AlphaFamImpute chromosome wise without a map. If this fixes it then there is some problem with either the map-imput data , which should be correct since the 1000 genomes extendend dataset is by default in hg38?

Npaffen commented 1 year ago

A run just with chr2 seems to work! Will now run AlphaFamImpute chromosome-wise and report back.

gregorgorjanc commented 1 year ago

I would appreciate a reproducible example.

--

With regards!


University of Edinburgh Gregor Gorjanc, PhD

The Roslin Institute Group leader & Royal Society Fellow

Easter Bush @@.***https://twitter.com/GregorGorjanc

EH25 9RG @@.**@.>

Scotland, UK www.ed.ac.uk/roslin/highlanderlabhttp://www.ed.ac.uk/roslin/highlanderlab


On Tue, Jan 24 2023 at 11:31 am, Npaffen @.**@.>> wrote: This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

Run with chr2 and this seems to work! Will now run AlphaFamImpute chromosome-wise and report back.

— Reply to this email directly, view it on GitHubhttps://github.com/AlphaGenes/AlphaFamImpute/issues/3#issuecomment-1401780815, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAFRGSXO3VSR4BZ42RCNZATWT64O5ANCNFSM6AAAAAAUEXHLKI. You are receiving this because you commented.Message ID: @.***>

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

Npaffen commented 1 year ago

I anticipate you mean a reproducible example of the error. Since I don't want to share my 23andme data I will have to try this just with the 1kg_extendend dataset and see if this happens without my data. I will report back if this error is reproducible just with the 1kg_extendend dataset.

gregorgorjanc commented 1 year ago

Maybe just a tiny fraction of the dataset - say first few loci?

Npaffen commented 1 year ago

Sure this should be possible. Let me prep something and see if I can reproduce the error with the subset.

Npaffen commented 1 year ago

I finally was able to create a reproducible example based only on the data of the 1000 genomes extended dataset including ~600 trios. I wrote a R-script and added all necessary files to the folder. The map-file is a subset of the genetic_map_hg38_withX_alpha.txt. This subsetted map only contains the autosome-positions and the header is removed so that the format of the documentation is satisfied. Be aware that you need plink 1.9 and convertf from the EIGENSTRAT software package and datamash sudo apt install datamash to run my code. While plink is a standard tool and I guess you are familiar with it, convertf was used to recreate your genotype.txt format. This step was necessary since the option to use plink-format as input failed on my linux system #4

Link to the files : https://e1.pcloud.link/publink/show?code=XZsLGmZsN2waHGnatf49E7M4IauwfV5kVpy