AlphaGenes / AlphaImpute2

AlphaImpute2: pedigree- and population-based genotype imputation
MIT License
11 stars 6 forks source link

Test example error #16

Open XingerTang opened 1 year ago

XingerTang commented 1 year ago

While running the test example, it raises the following error

Traceback (most recent call last):

File "/Users/evie/mambaforge/bin/AlphaImpute2", line 8, in <module>

sys.exit(main())

File "/Users/evie/mambaforge/lib/python3.10/site-packages/alphaimpute2/tinyhouse/Utils.py", line 8, in timer

values = func(*args, **kwargs)

File "/Users/evie/mambaforge/lib/python3.10/site-packages/alphaimpute2/alphaimpute2.py", line 304, in main

write_out_data(pedigree, args)

File "/Users/evie/mambaforge/lib/python3.10/site-packages/alphaimpute2/tinyhouse/Utils.py", line 8, in timer

values = func(*args, **kwargs)

File "/Users/evie/mambaforge/lib/python3.10/site-packages/alphaimpute2/alphaimpute2.py", line 255, in write_out_data

pedigree.writePhase(args.out + ".haplotypes")

File "/Users/evie/mambaforge/lib/python3.10/site-packages/alphaimpute2/tinyhouse/Pedigree.py", line 943, in writePhase

if ind.haplotypes.ndim == 2:  # diploid

AttributeError: 'tuple' object has no attribute 'ndim'
gregorgorjanc commented 1 year ago

@XingerTang do you have any sense of what is driving this error? can you also share here what exact command you use to get this error?

XingerTang commented 1 year ago

@gregorgorjanc The command I used to get this error is just the command to run the example code

cd docs
bash run_script.sh

The error is probably caused by the datatype of haplotypes in Pedigree.py of the submodule tinyhouse.

In line 155 of Pedigree.py, the haplotypes attribute of an object of the Individual class is initialized as a tuple.

self.haplotypes = (np.full(nLoci, 9, dtype = np.int8), np.full(nLoci, 9, dtype = np.int8))

However, in the function writePhase(), it tries to take the ndim of the haplotypes, which is not possible as it is a tuple instead of a numpy array.

gregorgorjanc commented 1 year ago

@XingerTang this is odd - tinyhouse and its haplotype attribute of the Individual class must be used elsewhere too, say in AlphaPeel. How is it handled there? Or do we never call for ind.haplotypes.ndim elsewhere?

What do you propose we do to address this issue?

XingerTang commented 1 year ago

@gregorgorjanc We only call ind.haplotype.ndim in writePhase() and writePhasePed() which are not frequently used.

I think it can be addressed by replacing the ind.haplotype.ndim with some other code and I believe it won't be a big problem though I need a bit more time.

XingerTang commented 1 year ago

@gregorgorjanc The problem is a little bit more complicated than I thought. I checked all the places where ind.haplotypes has been used and in some of them it is used as an ndarray, and for others it is used as a tuple of ndarrays.

For the ndarray case, ind.haplotypes is initialized in Pedigree.decode_alleles() and being used in Pedigree.encode_alleles() apart from writePhase() and writePhasePed(). But for the Pedigree.decode_alleles(), I find that the function will never be called as the condition to call it will never be satisfied (the condition is an argument ped which is not in the argument dictionary to be not None). For the Pedigree.encode_alleles(), it is only used in writePhase() and writePhasePed().

For the a tuple of ndarrays case, it is the one that is actually used when the program runs, no matter whether the input pedigree file exists or not.

I guess what happened is that, during the development of the program, the developers originally want to accept multiploid input, which the ndarray allows. Yet, later they decide to limit the case to only take diploid input, which the tuple is enough. So some of the functions written previously left and causes the mismatch of the datatypes.

If what I guess is correct, what I need to do is remove some of the functions that are not being used and then update writePhase() and writePhasePed().

gregorgorjanc commented 1 year ago

@XingerTang thanks for looking into this!

Which packages have you looked at? I think we should make sure these tools work (these use tinyhouse) - will they work with your proposed change in tiny house?

https://github.com/AlphaGenes/AlphaPeel

https://github.com/AlphaGenes/AlphaImpute2

https://github.com/AlphaGenes/AlphaFamImpute

https://github.com/AlphaGenes/AlphaPlantImpute2

https://github.com/AlphaGenes/AlphaAssign

XingerTang commented 1 year ago

@gregorgorjanc The datatype of ind.haplotypes in AlphaPeel and AlphaImpute2 are consistent, so I think the problems are only in the tinyhouse, it is where all the above inconsistencies occur. (And I just found that it is exactly the issue AlphaGenes/tinyhouse#1.)

If the issue is fixed, then the error won't be raised when writePhase() and writePhasePed() are called by the above programmes.

XingerTang commented 1 year ago

@gregorgorjanc I checked all 5 packages and all of them except AlphaPlantImpute2 are using the tuple datatype for ind.haplotypes. However, AlphaPlantImpute2 is using the ndarray datatype.

Another difference between the AlphaPlantImpute2 and the other 4 that the AlphaPlantImpute2 allows the input options -ped [PED [PED ...]] and -library LIBRARY which can take in PLINK files. However for the rest 4, they can only take in plink files with the input option -bfile [BFILE [BFILE ...]].

As AlphaPlantImpute2 is using the ndarray datatype, a simple modification of the datatype of the ind.haplotypes in tinyhouse would lead to inconsistencies of AlphaPlantImpute2. Since AlphaPlantImpute2 is quite different from the others in many ways, it is difficult to modify it to make it consistent with the others. So, the simplest way to solve everything is writing two separate systems in parallel.

gregorgorjanc commented 1 year ago

@XingerTang I think the best way forward would indeed be to first work with tuple datatype for ind.haplotypes since that is used more commonly across all the packages we are working with. If we work with these packages in the devel versions and then bump the version, we will get a working solution for these packages, and then we will later work with the AlphaPlantImpute2. Ideally, we would have the same framework for all the packages, but let's cross that bridge once we fix the other packages.

Having said all this I now wonder why we are in this situation! The other packages are somewhat older that the AlphaPlantImpute2, sort of. So, I wonder if recent changes in tinyhouse have been towards the use in AlphaPlantImpute2 (that is, using ndarray and not a tuple of ndarrays), which is causing the issue in other packages? Can you please check this?

XingerTang commented 1 year ago

@gregorgorjanc Sure! I will do that later!

XingerTang commented 1 year ago

@gregorgorjanc Yes, you are correct, tinyhouse is shifting from using tuple to ndarrays. We can first publish a stable version with tuple and then figure out how to move toward the ndarray on a separate branch.

gregorgorjanc commented 1 year ago

@XingerTang does this mean we should:

1) Be upgrading everything to ndarrays since this is where tinyhouse is going to?

2) Reverting some changes in tinyhouse and tweak AlphaPlantImpute2?

XingerTang commented 1 year ago

@gregorgorjanc I think we will first do 2 then do 1... But let's just wait for the response.

gregorgorjanc commented 1 year ago

@XingerTang which version/commit of tinyhouse uses only tuple of ndararys? Maybe we just stick to that version for now and port recent changes (help) to that, make it the main branch and push the current state with ndarray to devel branch?

Waiting for Steve, but noticed he moved to Ox. Have contacted him on Twitter;)

gregorgorjanc commented 1 year ago

@XingerTang looks like that Steve agrees with your proposed path (https://github.com/AlphaGenes/AlphaImpute2/issues/16#issuecomment-1623834978) forward!

Let's revert tinyhouse to tuples of ndarrays and when can down the line think what will need to be done for AlphaPlantImpute2.

XingerTang commented 1 year ago

@gregorgorjanc I tried to summarize what was happening with the tinyhouse, and found two possible points we can revert to.

Stable commits

writePhase() is the function directly called by the example of AlphaImpute2 and caused the error. It is also used in AlphaFamImpute.

The hash of the commit that modified the datatype of ind.haplotypes in tinyhouse.Pedigree.writePhase() is 67b18bc (its parent is d5d4f9d).

So, if our aim is only to fix the bug, it is enough to revert to d5d4f9d.

Still, there are functions other than tinyhouse.Pedigree.writePhase() that are using ndarray datatype and were created prior to the modification of the writePhase(), but they are not used by the packages except AlphaPlantImpute2, so they won't raise any error when you are using other 4 packages.

# the functions using ndarray
tinyhouse.Pedigree.readInPed()
tinyhouse.Pedigree.decode_alleles()
tinyhouse.Pedigree.encode_alleles()
tinyhouse.Pedigree.update_allele_coding()
tinyhouse.Pedigree.writeGenotypesPed()
tinyhouse.Pedigree.writePhasePed()

These functions are mostly created in the commit 0b02d79 (its parent is 0eb5c24) and experienced some small modifications later.

So, to get rid of all functions in tinyhouse with ndarray datatype, we need to revert to 0eb5c24.

Possible influences of reverting

To d5d4f9d

Not many things are changed since that commit, so not many things we would lose after reverting. At that point, tinyhouse was basically compatible with all 5 packages. But, tinyhouse.Pedigree.writePhase() would raise an error when AlphaImpute2 is calling it. Also, even though the functions in tinyhouse won't raise other errors, they are kind of messy.

To 0eb5c24

By reviewing the code, you can see that the updates toward the use of ndarray are closely related to updates toward the use of PLink and -ped (but this might not be the only updated feature that is correlated with the use of ndarray).

These features would disappear if we revert the commits, but luckily they are all solely used by AlphaPlantImpute2 so it won't affect a lot.

But it also means that we might need to rewrite AlphaPlantImpute2.

gregorgorjanc commented 1 year ago

Huh, messy. Excellent detective job!

How much work would it be in AlphaPeel and AlphaImpute2 (key tools just now, others could be adapted later) to move to ndarray structure? It feels that this should not be too hard of a job (the fix should be easy) - I can help - and then we just keep with the current tinyhouse

XingerTang commented 1 year ago

@gregorgorjanc haplotypes is an attribute of the object of class individual, which is defined in Pedigree.py in the package tinyhouse.

The way AlphaPeel and AlphaImpute2 use it is by importing the Pedigree.py and instantiating the individual objects. So they are not accessing the attributes haplotypesdirectly.

Thus, for AlphaPeel and AlphaImpute2 to move to ndarray structure, we need to modify everything in tinyhouse related to the tuple datatype of the haplotypes.

I'm not sure which method is easier, but this way is still kind of hard as almost everything in tinyhouse except those functions I listed above assumes the tuple datatype.

gregorgorjanc commented 1 year ago

Ok, looks like we really should work with tuples then.

A thought for intermediate period - can we modify tinyhouse.Pedigree.writePhase() to work with both cases? Nah, that will not work since it's a class atributte.

To save the work in tinyhouse since the bug was introduced, I propose the following:

1) Tag current tinyhouse with mix_of_tuple_and_ndarray just so we know what's the status 2) We keep the tiny house submodule at the current state in AlphaPlantImpute2 so it will work as expected 3) Create a devel branch of tinyhouse and fix tinyhouse.Pedigree.writePhase() by moving to tuple of ndarrays; the same for AlphaPeel and AlphaImpute2 - devel branch and we can experiment etc 4) If that brakes more code, let's fix it towards tuples (will help) 5) We deal with AlphaPlantImpute2 (moving to triples) down the line (it will still work until we upgrade the sub module)

how does this sound?

XingerTang commented 1 year ago

@gregorgorjanc I agree with most of it! But for step 3, I think we actually don't need to do anything with AlphaPeel and AlphaImpute2, as everything messy is inside tinyhouse. Also, I am a little confused with steps 3 and 4. It seems like both are moving towards tuples of ndarrays.

gregorgorjanc commented 1 year ago

@XingerTang by 3. and 4. I meant that let's do whatever needs doing in tinyhouse (on a devel branch) and if need be also on AlphaPeel and AlphaImpute2 (on a devel branch; but this fixing might not be needed at all here, but if it does, then we should just do it so we make some progress forward).

OK?

XingerTang commented 1 year ago

@gregorgorjanc OK! Let's do that.

sthorn commented 1 year ago

It seems to me that tinyhouse's Individual class was written to use tuples of haplotypes. I can see that I changed writePhase() to use an ndarray rather than the a tuple.

I think this is due to the way I wrote AlphaPlantImpute2’s write_library(), which forces the Individual object to use an ndarray instead of the tuple.

My conclusion is that the change I made to write_phase() was a mistake. So, I think that reverting that change and altering write_library() is the likely best way to fix this. However, I'm not sure what impact this would have on the other software, or, if it would break AlphaPlantImpute2 in other ways.

gregorgorjanc commented 1 year ago

@sthorn thanks for chipping in.

Yes, we will revert back to tuples of haplotypes and momentarily break AlphaPlantImpute2, but since we use submodules we are fine for the intermediate period.

Good, we are now going in the right direction;)

XingerTang commented 1 year ago

@gregorgorjanc Hi! I tried to follow the workflow you proposed. For 1, I created a tag on my forked repository. However, I found that GitHub does not allow to include the tags in the pull request. Since the main branch now preserve the information of the current state, I think this might not be a big problem. For 3, after the AlphaGenes/tinyhouse#4 is approved, I would update the submodule reference of the devel branch of AlphaImpute2 to see if the problem is solved.

gregorgorjanc commented 1 year ago

@XingerTang

1) I have now added the tag at https://github.com/AlphaGenes/tinyhouse/releases/tag/mix_of_tuple_and_ndarray

3) I merged your PR in

Let's see where this takes us;)

gregorgorjanc commented 1 year ago

@XingerTang I am still having the same problem as noted above - this is what I have done - have I missed anything?

git clone --recurse-submodules https://github.com/AlphaGenes/AlphaImpute2.git
mv AlphaImpute2 AlphaImpute2_upstream_test
cd AlphaImpute2_upstream_test
git pull
git checkout devel
git pull
python3 -m venv AlphaImpute2_env
source AlphaImpute2_env/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade build
python3 -m build
python3 -m pip install dist/AlphaImpute2-0.0.3-py3-none-any.whl
cd example
chmod a+x run_script.sh
./run_script.sh
deactivate AlphaImpute2_env

This is the tinyhouse

cd src/alphaimpute2/tinyhouse
git log

--> commit d62c78a66a4dc48c10e2d0a8353d944855d6da83 (HEAD, origin/main, origin/HEAD, main)
Merge: ef71923 bea5f86
Author: Gregor Gorjanc <gregor.gorjanc@gmail.com>
Date:   Fri Jun 23 18:43:34 2023 +0100

    Merge pull request #2 from XingerTang/main

    Modify InputOutput.py so it accepts -help

Aha, we still have the old submodule, but hmm, you have updated yesterday https://github.com/AlphaGenes/AlphaImpute2/pull/19/commits/be681d8214fe34f8077b5eadb1ba3bc69c721da4 - I don't think you have moved the reference to the latest submodule version have you?

XingerTang commented 1 year ago

@gregorgorjanc I have moved the reference to the latest submodule.

Screen Shot 2023-07-14 at 1 23 55 PM

I think the problem is git does not track other branches of the submodule, even if I put the reference to the HEAD commit of that branch, git just ignores it and continues to use the reference to the main branch. So it should be fixed if you checkout the devel branch of the submodule before generating the wheel.

git clone --recurse-submodules https://github.com/AlphaGenes/AlphaImpute2.git
mv AlphaImpute2 AlphaImpute2_upstream_test
cd AlphaImpute2_upstream_test
git pull
git checkout devel
git pull
cd src/alphaimpute2/tinyhouse
git checkout devel
cd ../../..
python3 -m venv AlphaImpute2_env
source AlphaImpute2_env/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade build
python3 -m build
python3 -m pip install dist/AlphaImpute2-0.0.3-py3-none-any.whl
cd example
chmod a+x run_script.sh
./run_script.sh
deactivate AlphaImpute2_env

Also, there is a way to track other branches of the submodule by modifying .gitmodules, however, it basically tells git only uses the reference of that branch and I'm not sure if we want the devel branch of the AlphaImpute2 only uses the reference of the devel branch of the tinyhouse, so I didn't change it.

gregorgorjanc commented 1 year ago

@XingerTang I think this must be related to how the submodule is configured initially? Anyway, I made it work on my end, so we achieved what we aimed to achieve! Yay!

This means that the fix in tinyhouse worked for AlphaImpute2 and likely some other tools, but we know this will cause issue for AlphaPlantImpute2 down the line (once we update tinyhouse reference in there). Since we have tagged tinyhouse commit that works for AlphaPlantImpute2 and we have an issue open for that in https://github.com/AlphaGenes/AlphaPlantImpute2/issues/3, we should now conclude that we can merge tinyhouse:devel branch into tinyhouse:main and update reference to it in AlphaImpute2 and AlphaPeel. This means that AlphaImpute2 and AlphaPeel should work out of the box - there will be no submodule issues.

Can you please do this - let's leave this issue open till that happens.

We are nearly there.

Thanks!

XingerTang commented 1 year ago

@gregorgorjanc Great!

What I got from Googling is that Git initially doesn't support tracking the non-default branch at all and adds this feature later at some point, and I guess that is the reason why we have to configure it by ourselves.

I will do

I will tick the checkbox once I'm done.

gregorgorjanc commented 1 year ago

Great planning and use of tickboxes;) Will try to use them more in the future too.

XingerTang commented 1 year ago

@gregorgorjanc Can I merge the devel branch to the main branch of both AlphaPeel and AlphaImpute2 before I create any pull requests?

gregorgorjanc commented 1 year ago

@XingerTang best if we first do it on the devel branch and once we are entirely happy with the status of the devel, we merge into main. The idea of the main is that it should always be deployable. I am aware that you are asking this because the way submodule tinyhouse reference seems to work only with the main branch, but for the AlphaPeel and AlphaImpute2 we don't have that limitation. OK?

XingerTang commented 1 year ago

@gregorgorjanc Sure I will do that!

XingerTang commented 1 year ago

@gregorgorjanc As I went through the code of AlphaPlantImpute2, I found that tinyhouse.CombinedHMM and tinyhouse.HaplotypeLibrary also uses the ndarray datatype intensively in addition to tinyhouse.Pedigree. While AlphaPeel, AlphaImpute2 and AlphaAssign haven't used tinyhouse.CombinedHMM and tinyhouse.HaplotypeLibrary, we have AlphaFamImpute uses tinyhouse.CombinedHMM in AlphaFamImpute.AlphaFamImpute and tinyhouse.HaplotypeLibrary in AlphaFamImpute.FamilyImputation, so AlphaFamImpute will also be influenced if we move toward tuples.

I didn't notice the datatype issue of AlphaFamImpute last time since it doesn't use individual.haplotypes() explicitly and everything in ndarray is in tinyhouse.

So to move forward, I may need to make modifications to tinyhouse.CombinedHMM, tinyhouse.HaplotypeLibrary and AlphaFamImpute as well.

Another thing I want to mention is that these modules using ndarray datatypes sometimes expect the haplotypes only have one haplotype instead of a pair of haplotypes which is expected in other softwares or modules that are using tuple datatype. And I think that might cause problems while moving toward the tuple datatype.

gregorgorjanc commented 1 year ago

@XingerTang great detective work!

This is likely the root cause of the unfortunate code split in tinyhouse - by "this" I mean the need to have one or more haplo functionality. We can revisit this question down the line.

At this stage for me, the first priority is on AlphaPeel due to current projects in the lab - two projects will be using AlphaPeel. So my strategy is to ensure AlphaPeel is easy to install, well documented, and reasonably tested so we ensure accuracy.

Second priority is for AlphaImpute2, then other tools.

With the above in mind and your detective work, I propose that you open an issue in all the affected Alpha tools and link to this issue here as a background explanation. Having those issues open there will remind us why the problem is, where the problem is coming from, and what we might want to do about it in the future. To avoid duplication, I propose we move this tuple vs ndarray discussion to tinyhouse as that is where the root cause is, while other Alpha tools only use the tinyhouse - but opening issues in the other Alpha tools and linking to this issue here and the tinyhouse issue will give us all the information to move onwards.

OK?

sthorn commented 1 year ago

I still think moving to tuples makes sense. Some refactoring of AlphaPlantImpute2 will be needed. Haploid samples can be held in a tuple of length one.

XingerTang commented 1 year ago

@sthorn Thanks for your comment, I will try to do it.

gregorgorjanc commented 1 year ago

Good point @sthorn! That could be a nice fix.