cmap / cmapPy

Assorted tools for interacting with .gct, .gctx files and other Connectivity Map (Broad Institute) data/tools
https://clue.io/cmapPy/index.html
BSD 3-Clause "New" or "Revised" License
124 stars 74 forks source link

Following tutorial yields error #32

Closed tstoeger closed 6 years ago

tstoeger commented 6 years ago

Following the tutorial cmapPy_pandasGEXpress_tutorial.ipynb currently (2018-March-03) yields an error.

Since it uses an external data set GEO GSE70138 (rather than a test contained within cmapPy) it isn't clear, if this error reflects upon an update or problem within cmapPy, the tutorial, or GSE70138. (Besides not being able to follow a tutorial, this error hence makes it difficult for new users to become familiar with gctx files / cmapPy.)

works: upper part of tutorial

import pandas as pd
sig_info = pd.read_csv("GSE70138_Broad_LINCS_sig_info.txt", sep="\t") # updated file name

vorinostat_ids = sig_info["sig_id"][sig_info["pert_iname"] == "vorinostat"]
# Let us additionally report on the data
print("number of samples treated with vorinostat:", len(vorinostat_ids))
print('\n---- show first ones for debugging ----')
[print(x) for x in vorinostat_ids.values[:5]];

number of samples treated with vorinostat: 210

---- show first ones for debugging ---- LJP007_A375_24H:A03 LJP007_A549_24H:A03 LJP007_ASC.C_24H:A03 LJP007_ASC_24H:A03 LJP007_CD34_24H:A03

creates error: loading of records

from cmapPy.pandasGEXpress import parse
vorinostat_only_gctoo = parse(
    "GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328.gctx",   # updated file name
    cid=vorinostat_ids)
/Users/tstoeger/apps/anaconda/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
some of the ids being used to subset the data are not present in the metadata for the file being parsed - mismatch_ids:  {'LJP009_HT29_24H:A03', 'LJP007_SKL.C_24H:A03', 'LJP008_PC3_24H:G08', 'LJP008_HCC515_24H:G07', 'LJP008_HA1E_24H:G07', 'LJP008_ASC_24H:G10', 'LJP008_NPC.CAS9_24H:A03', 'LJP008_SKL_24H:G08', 'LPROT003_PC3_6H:O11', 'LJP008_A549_24H:G09', 'LJP008_HCC515_24H:G10', 'LJP008_MCF7_24H:G07', 'LJP008_PC3_24H:G11', 'LJP008_HUVEC_24H:G11', 'LJP008_PC3_24H:A03', 'LJP009_A375_24H:A03', 'LJP008_ASC_24H:G11', 'LJP008_A549_24H:G11', 'LJP008_HEPG2_24H:G11', 'LJP008_HT29_24H:G07', 'LPROT003_A549_6H:O12', 'LJP008_HUVEC_24H:G07', 'LJP008_HUVEC_24H:G10', 'LJP008_HME1_24H:G12', 'LJP007_A375_24H:A03', 'LJP008_SKL.C_24H:G09', 'LJP008_NPC.CAS9_24H:G09', 'LJP007_HEPG2_24H:A03', 'LJP007_CD34_24H:A03', 'LPROT003_NPC_6H:P11', 'LJP008_HT29_24H:G12', 'LPROT001_A375_6H:P11', 'LJP008_HUVEC_24H:G12', 'LJP008_PC3_24H:G07', 'LJP008_ASC.C_24H:G11', 'LJP008_NEU_24H:G11', 'LJP008_SKL_24H:G10', 'LPROT003_A375_6H:P08', 'LPROT002_MCF7_6H:P12', 'LJP008_NEU_24H:G08', 'LJP008_HCC515_24H:G09', 'LJP008_ASC_24H:G08', 'LJP008_HME1_24H:A03', 'LJP008_NEU_24H:G09', 'LPROT001_PC3_6H:P10', 'LJP008_HEPG2_24H:G08', 'LJP008_HCC515_24H:A03', 'LJP009_SKL.C_24H:A03', 'LPROT003_A549_6H:O08', 'LJP009_HCC515_24H:A03', 'LJP008_ASC.C_24H:G10', 'LJP008_SKL.C_24H:G08', 'LJP008_CD34_24H:G12', 'LJP007_MCF7_24H:A03', 'LJP008_NPC_24H:G08', 'LJP008_SKL.C_24H:A03', 'LJP008_HEPG2_24H:G09', 'LJP008_HT29_24H:A03', 'LJP008_HA1E_24H:A03', 'LJP008_NPC_24H:G12', 'LJP008_A375_24H:G11', 'LJP009_CD34_24H:A03', 'LJP007_HME1_24H:A03', 'LJP009_MCF7_24H:A03', 'LJP008_A549_24H:G07', 'LJP008_NEU_24H:G12', 'LJP007_HT29_24H:A03', 'LJP008_HUVEC_24H:G08', 'LJP008_HUVEC_24H:A03', 'LJP008_A375_24H:G08', 'LJP008_HT29_24H:G10', 'LJP008_NPC.CAS9_24H:G11', 'LJP008_A375_24H:G09', 'LJP008_NEU_24H:G07', 'LJP008_SKL.C_24H:G10', 'LJP008_NEU_24H:A03', 'LJP009_NPC.CAS9_24H:A03', 'LPROT002_A549_6H:O09', 'LJP008_CD34_24H:G11', 'LJP008_NPC.CAS9_24H:G12', 'LJP009_ASC_24H:A03', 'LJP008_ASC_24H:G09', 'LJP008_HA1E_24H:G08', 'LJP008_SKL_24H:G07', 'LPROT001_MCF7_6H:O11', 'LJP008_A375_24H:A03', 'LJP008_CD34_24H:G07', 'LJP008_NPC.TAK_24H:G08', 'LPROT001_MCF7_6H:O07', 'LJP008_ASC_24H:A03', 'LJP008_PC3_24H:G10', 'LPROT001_A375_6H:P07', 'LPROT003_A375_6H:P10', 'LJP009_ASC.C_24H:A03', 'LPROT002_NPC.TAK_6H:O10', 'LJP009_SKL_24H:A03', 'LJP008_HT29_24H:G08', 'LJP008_PC3_24H:G09', 'LJP008_HCC515_24H:G08', 'LJP008_HME1_24H:G07', 'LJP008_SKL.C_24H:G07', 'LJP008_ASC.C_24H:G07', 'LJP008_ASC.C_24H:G09', 'LJP008_A375_24H:G12', 'LPROT003_NPC_6H:P09', 'LJP008_HT29_24H:G09', 'LPROT001_MCF7_6H:O09', 'LJP009_HA1E_24H:A03', 'LPROT003_PC3_6H:O07', 'LJP008_CD34_24H:A03', 'LJP007_A549_24H:A03', 'LJP008_HA1E_24H:G11', 'LJP007_HUES3_24H:A03', 'LPROT002_A375_6H:P07', 'LJP008_CD34_24H:G08', 'LJP008_MCF7_24H:G11', 'LJP008_A549_24H:G08', 'LJP009_HEPG2_24H:A03', 'LPROT001_PC3_6H:P08', 'LPROT003_NPC_6H:P07', 'LJP008_HME1_24H:G10', 'LJP007_SKL_24H:A03', 'LJP008_HA1E_24H:G10', 'LJP008_PC3_24H:G12', 'LJP008_SKL_24H:G09', 'LPROT001_PC3_6H:P12', 'LJP008_ASC_24H:G07', 'LPROT002_A375_6H:P11', 'LPROT003_A375_6H:P12', 'LJP008_NPC.TAK_24H:G11', 'LJP009_HUVEC_24H:A03', 'LJP009_HME1_24H:A03', 'LJP008_HCC515_24H:G12', 'LJP007_MNEU.E_24H:A03', 'LJP008_SKL_24H:G12', 'LJP008_A375_24H:G10', 'LJP009_NPC_24H:A03', 'LJP008_CD34_24H:G09', 'LJP008_HME1_24H:G09', 'LJP008_NEU_24H:G10', 'LJP008_MCF7_24H:G10', 'LJP008_A549_24H:A03', 'LJP008_HEPG2_24H:A03', 'LJP008_HME1_24H:G08', 'LJP008_NPC_24H:G07', 'LJP008_NPC.CAS9_24H:G08', 'LPROT002_MCF7_6H:P08', 'LJP008_NPC_24H:G09', 'LPROT001_A375_6H:P09', 'LJP008_ASC.C_24H:G08', 'LJP009_PC3_24H:A03', 'LJP008_HT29_24H:G11', 'LJP008_MCF7_24H:A03', 'LJP007_ASC_24H:A03', 'LJP008_NPC.CAS9_24H:G07', 'LPROT002_A549_6H:O07', 'LJP009_NPC.TAK_24H:A03', 'LJP007_NPC.TAK_24H:A03', 'LJP008_HEPG2_24H:G12', 'LJP008_NPC.CAS9_24H:G10', 'LPROT002_NPC.TAK_6H:O12', 'LJP008_NPC.TAK_24H:G10', 'LJP008_SKL_24H:A03', 'LJP008_SKL.C_24H:G11', 'LPROT001_NPC.TAK_6H:O10', 'LJP008_HCC515_24H:G11', 'LJP008_SKL.C_24H:G12', 'LJP008_ASC.C_24H:G12', 'LJP008_NPC_24H:A03', 'LJP007_NPC_24H:A03', 'LJP008_NPC.TAK_24H:G12', 'LPROT002_A549_6H:O11', 'LJP008_NPC.TAK_24H:A03', 'LJP008_HME1_24H:G11', 'LJP007_ASC.C_24H:A03', 'LJP008_MCF7_24H:G08', 'LJP007_HA1E_24H:A03', 'LJP008_MCF7_24H:G09', 'LJP008_ASC.C_24H:A03', 'LJP008_SKL_24H:G11', 'LJP008_A549_24H:G12', 'LPROT003_PC3_6H:O09', 'LJP007_HUVEC_24H:A03', 'LJP008_NPC_24H:G11', 'LPROT003_A549_6H:O10', 'LJP008_NPC.TAK_24H:G09', 'LJP008_HUVEC_24H:G09', 'LPROT001_NPC.TAK_6H:O08', 'LJP007_NEU_24H:A03', 'LJP008_NPC_24H:G10', 'LJP008_HA1E_24H:G09', 'LJP008_HEPG2_24H:G07', 'LJP008_A375_24H:G07', 'LJP008_MCF7_24H:G12', 'LJP008_NPC.TAK_24H:G07', 'LJP008_HEPG2_24H:G10', 'LPROT001_NPC.TAK_6H:O12', 'LJP007_JURKAT_24H:A03', 'LJP009_A549_24H:A03', 'LJP007_PC3_24H:A03', 'LPROT002_A375_6H:P09', 'LPROT002_NPC.TAK_6H:O08', 'LJP007_NPC.CAS9_24H:A03', 'LPROT002_MCF7_6H:P10', 'LJP008_HA1E_24H:G12', 'LJP009_NEU_24H:A03', 'LJP008_CD34_24H:G10', 'LJP007_HCC515_24H:A03', 'LJP008_ASC_24H:G12', 'LJP008_A549_24H:G10'}
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-f03c31e62771> in <module>()
      2 vorinostat_only_gctoo = parse(
      3     "GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328.gctx",   # updated file name
----> 4     cid=vorinostat_ids)

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse.py in parse(file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only, make_multiindex)
     60     elif file_path.endswith(".gctx"):
     61         curr = parse_gctx.parse(file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only,
---> 62                                 make_multiindex)
     63     else:
     64         err_msg = "File to parse must be .gct or .gctx!"

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in parse(gctx_file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only, make_multiindex)
    101 
    102         # validate optional input ids & get indexes to subset by
--> 103         (sorted_ridx, sorted_cidx) = check_and_order_id_inputs(rid, ridx, cid, cidx, row_meta, col_meta)
    104 
    105         data_dset = gctx_file[data_node]

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in check_and_order_id_inputs(rid, ridx, cid, cidx, row_meta_df, col_meta_df)
    140     ordered_ridx = get_ordered_idx(row_type, row_ids, row_meta_df)
    141 
--> 142     col_ids = check_and_convert_ids(col_type, col_ids, col_meta_df)
    143     ordered_cidx = get_ordered_idx(col_type, col_ids, col_meta_df)
    144     return (ordered_ridx, ordered_cidx)

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in check_and_convert_ids(id_type, id_list, meta_df)
    173         if id_type == "id":
    174             id_list = convert_ids_to_meta_type(id_list, meta_df)
--> 175             check_id_validity(id_list, meta_df)
    176         else:
    177             check_idx_validity(id_list, meta_df)

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in check_id_validity(id_list, meta_df)
    189             mismatch_ids)
    190         logger.error(msg)
--> 191         raise Exception("parse_gctx check_id_validity " + msg)
    192 
    193 

Exception: parse_gctx check_id_validity some of the ids being used to subset the data are not present in the metadata for the file being parsed - mismatch_ids:  {'LJP009_HT29_24H:A03', 'LJP007_SKL.C_24H:A03', 'LJP008_PC3_24H:G08', 'LJP008_HCC515_24H:G07', 'LJP008_HA1E_24H:G07', 'LJP008_ASC_24H:G10', 'LJP008_NPC.CAS9_24H:A03', 'LJP008_SKL_24H:G08', 'LPROT003_PC3_6H:O11', 'LJP008_A549_24H:G09', 'LJP008_HCC515_24H:G10', 'LJP008_MCF7_24H:G07', 'LJP008_PC3_24H:G11', 'LJP008_HUVEC_24H:G11', 'LJP008_PC3_24H:A03', 'LJP009_A375_24H:A03', 'LJP008_ASC_24H:G11', 'LJP008_A549_24H:G11', 'LJP008_HEPG2_24H:G11', 'LJP008_HT29_24H:G07', 'LPROT003_A549_6H:O12', 'LJP008_HUVEC_24H:G07', 'LJP008_HUVEC_24H:G10', 'LJP008_HME1_24H:G12', 'LJP007_A375_24H:A03', 'LJP008_SKL.C_24H:G09', 'LJP008_NPC.CAS9_24H:G09', 'LJP007_HEPG2_24H:A03', 'LJP007_CD34_24H:A03', 'LPROT003_NPC_6H:P11', 'LJP008_HT29_24H:G12', 'LPROT001_A375_6H:P11', 'LJP008_HUVEC_24H:G12', 'LJP008_PC3_24H:G07', 'LJP008_ASC.C_24H:G11', 'LJP008_NEU_24H:G11', 'LJP008_SKL_24H:G10', 'LPROT003_A375_6H:P08', 'LPROT002_MCF7_6H:P12', 'LJP008_NEU_24H:G08', 'LJP008_HCC515_24H:G09', 'LJP008_ASC_24H:G08', 'LJP008_HME1_24H:A03', 'LJP008_NEU_24H:G09', 'LPROT001_PC3_6H:P10', 'LJP008_HEPG2_24H:G08', 'LJP008_HCC515_24H:A03', 'LJP009_SKL.C_24H:A03', 'LPROT003_A549_6H:O08', 'LJP009_HCC515_24H:A03', 'LJP008_ASC.C_24H:G10', 'LJP008_SKL.C_24H:G08', 'LJP008_CD34_24H:G12', 'LJP007_MCF7_24H:A03', 'LJP008_NPC_24H:G08', 'LJP008_SKL.C_24H:A03', 'LJP008_HEPG2_24H:G09', 'LJP008_HT29_24H:A03', 'LJP008_HA1E_24H:A03', 'LJP008_NPC_24H:G12', 'LJP008_A375_24H:G11', 'LJP009_CD34_24H:A03', 'LJP007_HME1_24H:A03', 'LJP009_MCF7_24H:A03', 'LJP008_A549_24H:G07', 'LJP008_NEU_24H:G12', 'LJP007_HT29_24H:A03', 'LJP008_HUVEC_24H:G08', 'LJP008_HUVEC_24H:A03', 'LJP008_A375_24H:G08', 'LJP008_HT29_24H:G10', 'LJP008_NPC.CAS9_24H:G11', 'LJP008_A375_24H:G09', 'LJP008_NEU_24H:G07', 'LJP008_SKL.C_24H:G10', 'LJP008_NEU_24H:A03', 'LJP009_NPC.CAS9_24H:A03', 'LPROT002_A549_6H:O09', 'LJP008_CD34_24H:G11', 'LJP008_NPC.CAS9_24H:G12', 'LJP009_ASC_24H:A03', 'LJP008_ASC_24H:G09', 'LJP008_HA1E_24H:G08', 'LJP008_SKL_24H:G07', 'LPROT001_MCF7_6H:O11', 'LJP008_A375_24H:A03', 'LJP008_CD34_24H:G07', 'LJP008_NPC.TAK_24H:G08', 'LPROT001_MCF7_6H:O07', 'LJP008_ASC_24H:A03', 'LJP008_PC3_24H:G10', 'LPROT001_A375_6H:P07', 'LPROT003_A375_6H:P10', 'LJP009_ASC.C_24H:A03', 'LPROT002_NPC.TAK_6H:O10', 'LJP009_SKL_24H:A03', 'LJP008_HT29_24H:G08', 'LJP008_PC3_24H:G09', 'LJP008_HCC515_24H:G08', 'LJP008_HME1_24H:G07', 'LJP008_SKL.C_24H:G07', 'LJP008_ASC.C_24H:G07', 'LJP008_ASC.C_24H:G09', 'LJP008_A375_24H:G12', 'LPROT003_NPC_6H:P09', 'LJP008_HT29_24H:G09', 'LPROT001_MCF7_6H:O09', 'LJP009_HA1E_24H:A03', 'LPROT003_PC3_6H:O07', 'LJP008_CD34_24H:A03', 'LJP007_A549_24H:A03', 'LJP008_HA1E_24H:G11', 'LJP007_HUES3_24H:A03', 'LPROT002_A375_6H:P07', 'LJP008_CD34_24H:G08', 'LJP008_MCF7_24H:G11', 'LJP008_A549_24H:G08', 'LJP009_HEPG2_24H:A03', 'LPROT001_PC3_6H:P08', 'LPROT003_NPC_6H:P07', 'LJP008_HME1_24H:G10', 'LJP007_SKL_24H:A03', 'LJP008_HA1E_24H:G10', 'LJP008_PC3_24H:G12', 'LJP008_SKL_24H:G09', 'LPROT001_PC3_6H:P12', 'LJP008_ASC_24H:G07', 'LPROT002_A375_6H:P11', 'LPROT003_A375_6H:P12', 'LJP008_NPC.TAK_24H:G11', 'LJP009_HUVEC_24H:A03', 'LJP009_HME1_24H:A03', 'LJP008_HCC515_24H:G12', 'LJP007_MNEU.E_24H:A03', 'LJP008_SKL_24H:G12', 'LJP008_A375_24H:G10', 'LJP009_NPC_24H:A03', 'LJP008_CD34_24H:G09', 'LJP008_HME1_24H:G09', 'LJP008_NEU_24H:G10', 'LJP008_MCF7_24H:G10', 'LJP008_A549_24H:A03', 'LJP008_HEPG2_24H:A03', 'LJP008_HME1_24H:G08', 'LJP008_NPC_24H:G07', 'LJP008_NPC.CAS9_24H:G08', 'LPROT002_MCF7_6H:P08', 'LJP008_NPC_24H:G09', 'LPROT001_A375_6H:P09', 'LJP008_ASC.C_24H:G08', 'LJP009_PC3_24H:A03', 'LJP008_HT29_24H:G11', 'LJP008_MCF7_24H:A03', 'LJP007_ASC_24H:A03', 'LJP008_NPC.CAS9_24H:G07', 'LPROT002_A549_6H:O07', 'LJP009_NPC.TAK_24H:A03', 'LJP007_NPC.TAK_24H:A03', 'LJP008_HEPG2_24H:G12', 'LJP008_NPC.CAS9_24H:G10', 'LPROT002_NPC.TAK_6H:O12', 'LJP008_NPC.TAK_24H:G10', 'LJP008_SKL_24H:A03', 'LJP008_SKL.C_24H:G11', 'LPROT001_NPC.TAK_6H:O10', 'LJP008_HCC515_24H:G11', 'LJP008_SKL.C_24H:G12', 'LJP008_ASC.C_24H:G12', 'LJP008_NPC_24H:A03', 'LJP007_NPC_24H:A03', 'LJP008_NPC.TAK_24H:G12', 'LPROT002_A549_6H:O11', 'LJP008_NPC.TAK_24H:A03', 'LJP008_HME1_24H:G11', 'LJP007_ASC.C_24H:A03', 'LJP008_MCF7_24H:G08', 'LJP007_HA1E_24H:A03', 'LJP008_MCF7_24H:G09', 'LJP008_ASC.C_24H:A03', 'LJP008_SKL_24H:G11', 'LJP008_A549_24H:G12', 'LPROT003_PC3_6H:O09', 'LJP007_HUVEC_24H:A03', 'LJP008_NPC_24H:G11', 'LPROT003_A549_6H:O10', 'LJP008_NPC.TAK_24H:G09', 'LJP008_HUVEC_24H:G09', 'LPROT001_NPC.TAK_6H:O08', 'LJP007_NEU_24H:A03', 'LJP008_NPC_24H:G10', 'LJP008_HA1E_24H:G09', 'LJP008_HEPG2_24H:G07', 'LJP008_A375_24H:G07', 'LJP008_MCF7_24H:G12', 'LJP008_NPC.TAK_24H:G07', 'LJP008_HEPG2_24H:G10', 'LPROT001_NPC.TAK_6H:O12', 'LJP007_JURKAT_24H:A03', 'LJP009_A549_24H:A03', 'LJP007_PC3_24H:A03', 'LPROT002_A375_6H:P09', 'LPROT002_NPC.TAK_6H:O08', 'LJP007_NPC.CAS9_24H:A03', 'LPROT002_MCF7_6H:P10', 'LJP008_HA1E_24H:G12', 'LJP009_NEU_24H:A03', 'LJP008_CD34_24H:G10', 'LJP007_HCC515_24H:A03', 'LJP008_ASC_24H:G12', 'LJP008_A549_24H:G10'}
tstoeger commented 6 years ago

Overlooked the need for a very specific Python 2.7 environment (outlined in https://clue.io/cmapPy/build.html#install - and exceeding the information provided in readme - and being inconsistent with tutorial by leading to the setup of a cmappy version that would require parse.parse() instead of parse()).

To add to confusion the file names had changed between the tutorial and the public version of GSE70138 (which could have opened the possibility for a change of the file format ..).

oena commented 6 years ago

Hi @tstoeger, sorry you had difficulties in using the tutorial. If you have suggestions as to how to make installation instructions more clear, feel free to let us know; the README currently links out to ReadTheDocs in order to help us keep documentation in a centralized place and (hopefully) up to date.

Regarding the tutorial, I'll update the inconsistencies regarding use of parse methods. With regard to scope, we definitely hope to add more tutorials in the future, but for the time being only have one with GEO data because we guessed that would be the most common use case for the package. Just for the record--should you want to investigate error messages/bugs without dealing with external datasets in the future--we do already have a variety of files used for testing to disambiguate code vs. file issues; these are located in cmapPy/cmapPy/pandasGEXpress/tests/functional_tests.

tstoeger commented 6 years ago

Hi @oena ; Let me thank you at first - both for your inquiry, and the already existing documentation of cmapPY, which already has been very useful. Indeed the tutorial is a very nice extra.

My troubles had arisen from running into slightly different problems, and noticing that at least three distinct aspects seemed to have changed (version of used dataset, something related to external Python code, something related to cmapPY); As I'd take the tutorial as reference, this would hint at me overlooking something - but also not knowing for sure, which aspect I should trust or follow.

Possibly, the tutorial could:

oena commented 6 years ago

Those points all seem very reasonable to me, thanks! I'll see what we can do to address them better than we do currently.

benanbardak commented 5 years ago

Hi, Although I'm using Python version 2.7, I get the error "Exception: parse_gctx check_id_validity" that you received above, but not the metadata for the file being parsed - mismatch_ids: ... The file I'm trying to run is GSE92742. I would appreciate it if you could tell me how you solved the above problem.

tstoeger commented 5 years ago

I made a Python 3 compatible version of cmapPy; Credits for identifying critical section go to @heltena

In my usage scenario a single line addition was sufficient.

curr_dset.read_direct(temp_array)
temp_array = np.core.defchararray.decode(temp_array, 'utf8')  # <- introduced for Python3 compatibility
header_values[str(k)] = temp_array

My usage scenario was restricted to gctx files, which simplifies the problem of Python 3 compatibility. I didn’t check definition of gctx regarding future compatibility of encoding.I have only constructed tests with GSE92742 level 5, and I additionally bypassed GCToo instances as output I have always been only using the data frame contained within them (hence, I did not check their creation for compatibility with Python3). The above covers my usage of cmapPy.

saksham219 commented 5 years ago

Hi @benanbardak It would be helpful if you can mention which file you are using from GEO to read in the metadata. There are five files given here

benanbardak commented 5 years ago

Firstly thank you for response, I am using "GSE92742_Broad_LINCS_Level3_INF_mlr12k_n1319138x12328.gctx.gz". But I get an error "Exception: parse_gctx check_id_validity some of the ids being used to subset the data are not present in the metadata for the file being parsed - mismatch_ids:.."

saksham219 commented 5 years ago

That is a 48 GB file so I will take some time to try to download it. I tried it with another file from the same series "GSE92742_Broad_LINCS_Level2_GEX_delta_n49216x978.gctx.gz" and metadata parsing is working in python2. If you can try it with this file, and it fails then the issue might be with your version of cmapPy. If it does not fail with this smaller file, it might be the case that the 48gb file has something different going on that the package is not able to handle

benanbardak commented 5 years ago

And please check your email.. @tstoeger

benanbardak commented 5 years ago

@saksham219 I tried to run tutorial with this data "GSE92742_Broad_LINCS_Level2_GEX_delta_n49216x978.gctx.gz". But again I get an same error. How can I solve this problem? What does mean "the issue might be with your version of cmapPy. " How can I fixed version of cmapPy? Thank you so much.

saksham219 commented 5 years ago

@benanbardak What I mean is that you might not be using the latest version on the master branch of this repo. you can try running this from the terminal

$ git clone https://github.com/cmap/cmapPy
$ pip install cmapPy/

and then trying to read the file again in a new python environment.

If the problem still persists, it would be helpful if you could list down the versions of the packages in your python by $ pip freeze