FinucaneLab / pops

GNU General Public License v3.0
64 stars 12 forks source link

ValueError when running pops.py #10

Open twoneu opened 2 years ago

twoneu commented 2 years ago

Hi all,

Thank you for putting this package together. I am attempting to run PoPs on publicly available GWAS summary statistics but am running into the following error at the pops.py step:

Traceback (most recent call last):
  File "/usr/path/pops/pops.py", line 912, in <module>
    main(config_dict)
  File "/usr/path/pops/pops.py", line 879, in main
    preds_df = pops_predict(mat, rows, cols, coefs_df)
  File "/usr/path/pops/pops.py", line 685, in pops_predict
    pred = mat.dot(coefs_df.loc[cols].beta.values)
ValueError: shapes (18383,743) and (1867,) not aligned: 743 (dim 1) != 1867 (dim 0)

I think this issue has to do with my munged feature files. I followed Step 0 to munge the "mouse_brain2" feature files from https://github.com/FinucaneLab/gene_features and tried to rerun your example analysis, but the same error occurs. Could you advise me on the best way to generate the munged features from that Github repo?

Thank you!

vinodhsri commented 2 years ago

Hi, I am experiencing a similar issue as well. Looks like the munged features dimension while performing matrix dot product is causing this issue.

File "./src/pops-master/pops.py", line 685, in pops_predict pred = mat.dot(coefs_df.loc[cols].beta.values) ValueError: shapes (18383,105) and (651,) not aligned: 105 (dim 1) != 651 (dim 0)

Any thoughts/pointers?

thanks

vinodhsri commented 2 years ago

I think i figured out the issue. The gene_features datasets deposited at the location 'https://github.com/FinucaneLab/gene_features/tree/master/features' stratified by tissue types have few column headers with the same name repeating across multiple datasets.

For example, when looking into human_airways datasets available at 'https://github.com/FinucaneLab/gene_features/tree/master/features/human_airway', you will notice few column names are repeating across multiple datasets (For example 'Cluster6' is a column name repeating across multiple datasets. The fix is to have column names unique across multiple datasets, as the munge process will now treat individual columns as uniq.

This fix will ensure only 1 unique column getting selected by the 'pops.py' codes which causes the dot product dimensions to match.

`grep Cluster6 *

average_expression.txt:ENSG Cluster0 Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6 Cluster7 Cluster8 Cluster9 Cluster10 Cluster11 Cluster12 Cluster13 Cluster14 Cluster15 Cluster16 Cluster17 Cluster18 Cluster19 Cluster20 Cluster21 Cluster22 Cluster23 Cluster24 Cluster25 Allcells

diffexprs_down_genes_clusters.txt:ENSG Cluster1 Cluster16 Cluster6 Cluster14 Cluster5 Cluster9 Cluster2 Cluster26 Cluster23 Cluster22 Cluster15 Cluster7 Cluster24 Cluster12 Cluster3 Cluster21 Cluster11 Cluster8 Cluster10 Cluster18 Cluster25 Cluster17 Cluster19 Cluster4 Cluster13 Cluster20

diffexprs_genes_clusters.txt:ENSG Cluster8 Cluster19 Cluster2 Cluster17 Cluster3 Cluster23 Cluster14 Cluster11 Cluster18 Cluster21 Cluster5 Cluster22 Cluster9 Cluster25 Cluster13 Cluster10 Cluster16 Cluster24 Cluster12 Cluster7 Cluster4 Cluster20 Cluster6 Cluster1 Cluster15 Cluster26

diffexprs_tstat_clusters.txt:ENSG Cluster1 Cluster20 Cluster16 Cluster17 Cluster10 Cluster6 Cluster18 Cluster8 Cluster25 Cluster21 Cluster19 Cluster24 Cluster13 Cluster15 Cluster11 Cluster14 Cluster5 Cluster9 Cluster12 Cluster2 Cluster26 Cluster4 Cluster7 Cluster3 Cluster23 Cluster22

projected_pcaloadings_clusters.txt:ENSG Cluster0_PC_1 Cluster0_PC_2 Cluster0_PC_3 Cluster0_PC_4 Cluster0_PC_5 Cluster0_PC_6 Cluster0_PC_7 Cluster0_PC_8 Cluster0_PC_9 Cluster0_PC_10 Cluster1_PC_1 Cluster1_PC_2 Cluster1_PC_3 Cluster1_PC_4 Cluster1_PC_5Cluster1_PC_6 Cluster1_PC_7 Cluster1_PC_8 Cluster1_PC_9 Cluster1_PC_10 Cluster2_PC_1 Cluster2_PC_2 Cluster2_PC_3 Cluster2_PC_4 Cluster2_PC_5 Cluster2_PC_6 Cluster2_PC_7 Cluster2_PC_8 Cluster2_PC_9 Cluster2_PC_10 Cluster3_PC_1 Cluster3_PC_2 Cluster3_PC_3 Cluster3_PC_4 Cluster3_PC_5 Cluster3_PC_6 Cluster3_PC_7 Cluster3_PC_8 Cluster3_PC_9 Cluster3_PC_10 Cluster4_PC_1 Cluster4_PC_2 Cluster4_PC_3 Cluster4_PC_4 Cluster4_PC_5 Cluster4_PC_6 Cluster4_PC_7 Cluster4_PC_8 Cluster4_PC_9 Cluster4_PC_1Cluster5_PC_1 Cluster5_PC_2 Cluster5_PC_3 Cluster5_PC_4 Cluster5_PC_5 Cluster5_PC_6 Cluster5_PC_7 Cluster5_PC_8 Cluster5_PC_9 Cluster5_PC_10 Cluster6_PC_1 Cluster6_PC_2 Cluster6_PC_3 Cluster6_PC_4 Cluster6_PC_5 Cluster6_PC_6 Cluster6_PC_7 Cluster6_PC_8 Cluster6_PC_9 Cluster6_PC_10 Cluster7_PC_1 Cluster7_PC_2 Cluster7_PC_3 Cluster7_PC_4 Cluster7_PC_5 Cluster7_PC_6 Cluster7_PC_7 Cluster7_PC_8 Cluster7_PC_9 Cluster7_PC_10 Cluster8_PC_1 Cluster8_PC_2 Cluster8_PC_3 Cluster8_PC_4 Cluster8_PC_5Cluster8_PC_6 Cluster8_PC_7 Cluster8_PC_8 Cluster8_PC_9 Cluster8_PC_10 Cluster9_PC_1 Cluster9_PC_2 Cluster9_PC_3 Cluster9_PC_4 Cluster9_PC_5 Cluster9_PC_6 Cluster9_PC_7 Cluster9_PC_8 Cluster9_PC_9 Cluster9_PC_10 Cluster10_PC_1 Cluster10_PC_2 Cluster10_PC_3 Cluster10_PC_4 Cluster10_PC_5 Cluster10_PC_6 Cluster10_PC_7 Cluster10_PC_8 Cluster10_PC_9 Cluster10_PC_10 Cluster11_PC_1 Cluster11_PC_2 Cluster11_PC_3 Cluster11_PC_4 Cluster11_PC_5 Cluster11_PC_6 Cluster11_PC_7 Cluster11_PC_8 Cluster11_PC_9 Cluster11_PC_10 Cluster12_PC_1 Cluster12_PC_2 Cluster12_PC_3 Cluster12_PC_4 Cluster12_PC_5 Cluster12_PC_6 Cluster12_PC_7 Cluster12_PC_8 Cluster12_PC_9 Cluster12_PC_10 Cluster13_PC_1 Cluster13_PC_2 Cluster13_PC_3 Cluster13_PC_4 Cluster13_PC_5 Cluster13_PC_6 Cluster13_PC_7 Cluster13_PC_8 Cluster13_PC_9 Cluster13_PC_10 Cluster14_PC_1 Cluster14_PC_2 Cluster14_PC_3 Cluster14_PC_4 Cluster14_PC_5 Cluster14_PC_6 Cluster14_PC_7 Cluster14_PC_8 Cluster14_PC_9 Cluster14_PC_10 Cluster15_PC_1 Cluster15_PC_2 Cluster15_PC_3 Cluster15_PC_4 Cluster15_PC_5 Cluster15_PC_6 Cluster15_PC_7 Cluster15_PC_8 Cluster15_PC_9 Cluster15_PC_10 Cluster16_PC_1 Cluster16_PC_2 Cluster16_PC_3 Cluster16_PC_4 Cluster16_PC_5 Cluster16_PC_6 Cluster16_PC_7 Cluster16_PC_8 Cluster16_PC_9 Cluster16_PC_10 Cluster17_PC_1 Cluster17_PC_2 Cluster17_PC_3 Cluster17_PC_4 Cluster17_PC_5 Cluster17_PC_6 Cluster17_PC_7 Cluster17_PC_8 Cluster17_PC_9 Cluster17_PC_10 Cluster18_PC_1 Cluster18_PC_2 Cluster18_PC_3 Cluster18_PC_4 Cluster18_PC_5 Cluster18_PC_6 Cluster18_PC_7 Cluster18_PC_8 Cluster18_PC_9 Cluster18_PC_10 Cluster19_PC_1 Cluster19_PC_2 Cluster19_PC_3 Cluster19_PC_4 Cluster19_PC_5 Cluster19_PC_6 Cluster19_PC_7 Cluster19_PC_8 Cluster19_PC_9 Cluster19_PC_10 Cluster20_PC_1 Cluster20_PC_2 Cluster20_PC_3 Cluster20_PC_4 Cluster20_PC_5 Cluster20_PC_6 Cluster20_PC_7 Cluster20_PC_8 Cluster20_PC_9 Cluster20_PC_10 Cluster21_PC_1 Cluster21_PC_2 Cluster21_PC_3 Cluster21_PC_4 Cluster21_PC_5 Cluster21_PC_6 Cluster21_PC_7 Cluster21_PC_8 Cluster21_PC_9 Cluster21_PC_10 Cluster22_PC_1 Cluster22_PC_2 Cluster22_PC_3 Cluster22_PC_4 Cluster22_PC_5 Cluster22_PC_6 Cluster22_PC_7 Cluster22_PC_8 Cluster22_PC_9 Cluster22_PC_10 Cluster23_PC_1 Cluster23_PC_2 Cluster23_PC_3 Cluster23_PC_4 Cluster23_PC_5 Cluster23_PC_6 Cluster23_PC_7 Cluster23_PC_8 Cluster23_PC_9 Cluster23_PC_10`

vinodhsri commented 2 years ago
   for f in all_feature_files:
        f_df = pd.read_csv(f, sep="\t", index_col=0).astype(np.float64)
        f_df = gene_annot_df.merge(f_df, how="left", left_index=True, right_index=True)
        **base_filename=os.path.basename(f)
        fname=base_filename.replace(r".txt.gz","_")
        f_df.columns = f_df.columns.str.replace(r"Cluster", fname)**
        if nan_policy == "raise":
            assert not f_df.isnull().values.any(), "Missing genes in feature matrix."
        elif nan_policy == "ignore":
            pass
        elif nan_policy == "mean":
            f_df = f_df.fillna(f_df.mean())
        elif nan_policy == "zero":
            f_df = f_df.fillna(0)
vinodhsri commented 2 years ago

Below highlighted code snippets to the <> should take care of issue related to dot product dimension problem <> .

In brief, column headers in each tissue-level dataset is modified to also include the dataset filename and this should ensure unique column headers in the output from the munge process.

   for f in all_feature_files:
        f_df = pd.read_csv(f, sep="\t", index_col=0).astype(np.float64)
        f_df = gene_annot_df.merge(f_df, how="left", left_index=True, right_index=True)
        **base_filename=os.path.basename(f)
        fname=base_filename.replace(r".txt.gz","_")
        f_df.columns = f_df.columns.str.replace(r"Cluster", fname)**
        if nan_policy == "raise":
            assert not f_df.isnull().values.any(), "Missing genes in feature matrix."
        elif nan_policy == "ignore":
            pass
        elif nan_policy == "mean":
            f_df = f_df.fillna(f_df.mean())
        elif nan_policy == "zero":
            f_df = f_df.fillna(0)