Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
118 stars 22 forks source link

Index out of bounds error in check_ref() #88

Open evakoe opened 2 months ago

evakoe commented 2 months ago

Dear all, I am running gwaslab v3.4.43 to convert metal summary stats. For some files, my code finished successfully, yet for some I get an error with the check_ref() function. Can you advise me if this is a user error or a bug? I post the output below. Thank you!

sumstats.check_ref(ref_seq=ref) 2024/04/18 13:06:03 Start to check if NEA is aligned with reference sequence...v3.4.43 2024/04/18 13:06:03 -Current Dataframe shape : 74049597 x 11 ; Memory usage: 5265.82 MB 2024/04/18 13:06:03 -Reference genome FASTA file: /home/shared/bioinf/NGSpipeline/data/GRCh38/GRCh38_primary_contigs.fasta 2024/04/18 13:06:03 -Loading fasta records:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
2024/04/18 13:06:23 -Checking records 2024/04/18 13:06:31 -Building numpy fasta records from dict 2024/04/18 13:07:09 -Checking records for ( len(NEA) <= 4 and len(EA) <= 4 ) Traceback (most recent call last): File "", line 1, in File "/home/ekoenig/miniconda3/envs/gwaslab_test/lib/python3.9/site-packages/gwaslab/g_Sumstats.py", line 431, in check_ref self.data = checkref(self.data,ref_seq,log=self.log,*kwargs) File "/home/ekoenig/miniconda3/envs/gwaslab_test/lib/python3.9/site-packages/gwaslab/hm_harmonize_sumstats.py", line 650, in checkref sumstats.loc[to_check_ref,status] = check_status(sumstats_to_check, all_records_dict, log=log, verbose=verbose) File "/home/ekoenig/miniconda3/envs/gwaslab_test/lib/python3.9/site-packages/gwaslab/hm_harmonize_sumstats.py", line 597, in check_status sumstats.loc[condition, status] = _fast_check_status(sumstats_cond, record=record, starting_positions=starting_pos_cond) File "/home/ekoenig/miniconda3/envs/gwaslab_test/lib/python3.9/site-packages/gwaslab/hm_harmonize_sumstats.py", line 506, in _fast_check_status output_nea = np.take(record, indices) File "/home/ekoenig/miniconda3/envs/gwaslab_test/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 192, in take return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode) File "/home/ekoenig/miniconda3/envs/gwaslab_test/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 59, in _wrapfunc return bound(args, **kwds) IndexError: index 3071194638 is out of bounds for axis 0 with size 3031042421

Cloufield commented 2 months ago

Hi, Thanks for the feedback. I think this might be a bug since we vectorized the implementation since v3.4.42. We will check where the error is. For now, you can use sumstats.check_ref(ref_seq=ref, ref_seq_mode="s") to switch back to the old implemetation.

Cloufield commented 2 months ago

I just did some tests and I think probably you have variants with POS exceeding the maximum length of the reference sequence of that chromosome. Would you please check the max POS for each CHR in your sumstats? It would be great if you could let me know if there are such outliers. Thanks!

Cloufield commented 2 months ago

@sup3rgiu Hi, just let you know that when POS of a variant > the length of the reference sequence of that chromosome, there will be errors or wrong results for those variants in check_ref(). I am not sure if this will also affect your large-scale analysis.

I added additional outlier checks for this in this [commit].(https://github.com/Cloufield/gwaslab/commit/2a902cbc705695618b9b3149a9f4f289472162f4) Basically, create a records_len_dict, use that dict to check and create masks for outliers first, and then set the values to PADDING_VALUE using that mask for output_ea/nea. I would appreciate it if you could let me know if I missed something here. Thanks!

evakoe commented 2 months ago

Hi Yunye, here are the max positions for all my chromosomes (GRCh38). Indeed already for chr1, the position exceeds the length of the chromosome, I need to investigate this. Thank you.

Chr Pos_b38 1 249144386 2 242183241 3 198229721 4 190181953 5 181477945 6 170744431 7 159335860 8 145999616 9 140444627 10 134464014 11 135076606 12 133265030 13 114353906 14 138557333 15 101980928 16 90226380 17 83246833 18 80262670 19 101685764 20 64334132 21 46699801 22 50806802 23 196193117

sup3rgiu commented 2 months ago

Hi @Cloufield, Thanks for letting me know. Even if it's more a problem with the input sumstats and/or reference sequence, your changes seem like a good workaround!

I also saw that you changed the ref_seq_mode back to the old one. That's fine by me, but you should probably add a log message to let the user know about the possibility of speeding up execution with the v mode. Thanks!

Cloufield commented 2 months ago

Hi @sup3rgiu Thanks for checking the commit. I changed default mode it because I was not very confident about the change I made. After you checking, I will change it back to v mode in next version and revise the documentation. Thanks again for your effort.

sup3rgiu commented 2 months ago

Got it. I will do some extra tests as soon as I can and then I will report back to you. Thanks