PalamaraLab / HapNe

Haplotype-based inference of recent effective population size in modern and ancient DNA samples
GNU General Public License v3.0
7 stars 2 forks source link

vcf2fastsmc_in_parallel() takes 2 positional arguments but 5 were given #4

Open plubbe opened 3 months ago

plubbe commented 3 months ago

When trying to run HapNe-IBD as recommended by guidelines on a standard set of human WGS (n=25), using the provided code in a file called split_vcf_chr_arm.py: split_vcf(vcf_file='./vcf_phased', save_in='./split_by_chr_arm/', keep=None, genome_build='grch38')

which is called by a bash script (for purposes of remote cluster computing): python split_vcf_chr_arm.py

I receive the following error: Traceback (most recent call last): File "split_vcf_chr_arm.py", line 2, in <module> split_vcf(vcf_file='vcf_phased', \ File "~/HapNe/lib/python3.12/site-packages/hapne/convert/tools.py", line 213, in split_vcf vcf2fastsmc_in_parallel(ii, vcf_file, save_in, keep, genome_build) TypeError: vcf2fastsmc_in_parallel() takes 2 positional arguments but 5 were given

I managed to correct this by slightly modifying the original vcf2fastsmc_in_parallel in tools.py (to accept more than 2 arguments), but thought perhaps the devs would like to rectify the code?

plubbe commented 3 months ago

Another two comments for the devs:

Firstly, when running HapNe-IBD, I found a consistent issue with trying to point to the output files of IBD. The issue occured whether the config file was created to point to the directory containing the output of hap-ibd, or whether the config was pointed to specific .ibd.gz files.

The issue stems from the following line [line 35] of src/hapne/ibd.py: command = f"for IBDFILE inls {ibd_folder}/{name}.*.ibd.gz" \

which automatically adds an extra fullstop after the filename and before the .ibd.gz designation. I could figure out no way to get around this problem in my own script without brute forcing the source code, removing the extraneous fullstop to read: command = f"for IBDFILE inls {ibd_folder}/{name}*.ibd.gz`" \

Secondly, from hapne.ibd import build_hist should be written in the provided python script

Thirdly, the outputs of build_hist(config) are stored in output_folder/IBD, but hapne_ibd(config) looks for them in output_folder/HIST. I had to modify ibd.py [line 46] accordingly to get it to find the files.

Fourth, the config file requires another variable, nb_samples: I presume this is the number of samples in the input.

I urge the devs to make a few changes to the hapne-ibd portion of their manual, and perhaps the source code, so that other folks don't have to spend days mucking about in the source code to get things to run.

plubbe commented 3 months ago

It is also not very clear to me - the manual mentions to run Hap-IBD on each chromosome arm separately. How is the data meant to be combined afterwards into a single graph? The output is theoretically a hapne_results.png for every arm. Is there code provided for combining these outputs?

vicbp1 commented 2 months ago

Hi, any updates on these questions? These are also relevant to me

rmnfournier commented 2 months ago

Sorry about the wait; yes, I am working on it; a fix should be available by the end of next week!

rmnfournier commented 1 month ago

The latest commit should resolve the issue. Thanks for bringing this to our attention!