RMBarnard / raresim

Scalable rare-variant simulations
MIT License
0 stars 4 forks source link

Lengths of legend and hap files do not match #13

Open harish3689 opened 5 months ago

harish3689 commented 5 months ago

Hello. I am trying to run RAREsim but I keep getting an error message saying "Lengths of legend 28687 and hap 28686 files do not match". Moreover, there is run-to-run variability with the same input files and the hap file lengths in other runs are 28684, 28685, etc. I checked the lengths of the hap files generated by Hapgen and they seem to be of the expected size (28687). I believe this might have something to do with the "sparse" function in "rareSim.pyx" - either the generation of the sparse matrix file and/or loading it? I am running Raresim v 2.0, with Hapgen v 2.1.2, on an Ubuntu 20.04 machine. Any help would be much appreciated. Thanks in advance.

JessMurphy commented 3 months ago

Hi Harish, are you receiving an error when running the setup.py script? If so, the C code could not be fully compiling and causing issues with the sparse matrix.

harish3689 commented 3 months ago

I see a bunch of warnings, but no error, when I run setup.py. These are the last few lines of the compilation:

At top level:
rareSim.c:6072:18: warning: ‘__pyx_f_7rareSim_from_bytes’ defined but not used [-Wunused-function]
 6072 | static PyObject *__pyx_f_7rareSim_from_bytes(PyObject *__pyx_v_s) {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~
zip_safe flag not set; analyzing archive contents...
__pycache__.rareSim.cpython-310: module references __file__
harish3689 commented 3 months ago

Hi Jessica,

Thank you for replying to my post on Github. If it’s easier, we can try to resolve this over email and then post the solution on Github. I would really like to get this resolved soon. I am up for chatting over a video call if that’s easier for you. Thank you so much. I really appreciate your help.

Best, Harish

On Mar 12, 2024, at 2:16 PM, Jessica Murphy @.***> wrote:

Hi Harish, are you receiving an error when running the setup.py script? If so, the C code could not be fully compiling and causing issues with the sparse matrix.

— Reply to this email directly, view it on GitHub https://github.com/RMBarnard/raresim/issues/13#issuecomment-1992277539, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG5I72DVKXSOG3NUF46MGMDYX5BA3AVCNFSM6AAAAABDBX6YU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJSGI3TONJTHE. You are receiving this because you authored the thread.

JessMurphy commented 3 months ago

I could do email or Github - whichever is easiest for you. Are you using the latest version of raresim cloned from the Github repo? An update to the C code for the sparse matrix was made a couple of months ago.

harish3689 commented 3 months ago

I have tested the 4 versions under Releases - v1.0, v 2.0, v2.1.0 and v2.1.1.

JessMurphy commented 3 months ago

The C update was made after the latest pre-release (v2.1.1) so I would clone the code directly from the repo git clone https://github.com/RMBarnard/raresim.git and see if that works.

RMBarnard commented 3 months ago

Hello,

Sorry to take so long to get to this, life has been quite busy lately. I remember back in the fall I had created a branch locally to experiment with doing a re-architect of raresim to simplify the code and eliminate the need for the C lib entirely so that debugging would be more friendly. I'll go find that branch and see what kind of progress I made since this C lib has been causing issues for quite a while now. If I can get something that works end to end up and running, I'll push it and then testing can be done on it

RMBarnard commented 3 months ago

My old branch got lost but I went ahead and created a new one today and essentially re-did the entire C-based lib in Python. In initial testing, the converting of a haps file to a haps.sm file appears to work and I was able to get it to have the same (if not slightly better) space efficiency with the compression. I want to do a little more testing with it and I need to implement the ability for it to read from a g-zipped file, but it looks promising.

harish3689 commented 3 months ago

Thank you for your comments. I tried the latest version of the code on the Github repo (not any of the releases and pre-releases) and I no longer get the "Lengths of legend and hap files do not match" error.

However, I am noticing that the number of positions in the pruned haplotypes file is significantly (about an order of magnitude) lower compared to the original haplotypes file. Is this expected? Why/how is this happening? How are we supposed to deal with the missing data?

RMBarnard commented 3 months ago

Can you expand on your question? During the pruning process, any rows of all zeros will be removed from the haplotype (and legend) files. At one point the -z parameter was added to try to tell the program not to remove such rows, but I think success with that flag has been somewhat hit or miss.

Would that explain what you are seeing or are you seeing a different issue?

JessMurphy commented 3 months ago

Hi Harish, as Ryan said, pruned variants are removed from the files but we're working on updating the z flag so that those rows don't have to be removed. Once the z flag is updated, we'll make a new release that also includes the C update. As for now, we've been using the following work around in R to add the rows of zeros back into the haplotypes.

haps.pruned.all = data.frame(matrix(0, nrow=nrow(leg.all), ncol=ncol(haps.pruned))) haps.pruned.all[which(leg.all$id %in% leg.pruned$id),] = haps.pruned

harish3689 commented 3 months ago

Thank you, Ryan and Jess. That makes sense. But I am seeing some examples where rows are removed even though they do not contain all zeros. Is this expected, and if so, is there an explanation for this?

RMBarnard commented 3 months ago

If I had to take a guess without debugging, I would say it is likely the rows that you are seeing get removed are being pruned by the pruning algorithm. If the row has some ones in it but then all of them are pruned out, it will result in removal of the row.