cmap / cmapPy

Assorted tools for interacting with .gct, .gctx files and other Connectivity Map (Broad Institute) data/tools
https://clue.io/cmapPy/index.html
BSD 3-Clause "New" or "Revised" License
124 stars 74 forks source link

mismatch between metadata and gctx #62

Open ElyasMo opened 4 years ago

ElyasMo commented 4 years ago

I am trying to parse the:

1-GSE70138_Broad_LINCS_Level3_INF_mlr12k_n345976x12328_2017-03-06.gctx.gz 2-GSE70138_Broad_LINCS_Level3_INF_mlr12k_n78980x22268_2015-06-30.gct.gz 3-GSE70138_Broad_LINCS_Level4_ZSPCINF_mlr12k_n113012x22268_2015-12-31.gct.gz

files with:

1-GSE70138_Broad_LINCS_sig_info_2017-03-06.txt.gz or 2-GSE70138_Broad_LINCS_inst_info_2017-03-06.txt.gz

metadata files. I am trying to make a subset of files to make the process possible and easy to handle.

import pandas as pd sig_info = pd.read_csv("GSE70138_Broad_LINCS_sig_info_2017-03-06.txt", sep="\t") mcf7_cell = sig_info["pert_id"][sig_info["cell_id"] == "MCF7"][sig_id["pert_idose"]=="10.0 um"][sig_info["pert_itime"]=="24 h"] from cmapPy.pandasGEXpress.parse import parse MCF7_details = parse("GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx", cid=mcf7_cell)

Each time I do this with the: GSE70138_Broad_LINCS_Level3_INF_mlr12k_n345976x12328_2017-03-06.gctx.gz I see an error:

some of the ids being used to subset the data are not present in the metadata for the file being parsed - mimatch_ids: {'neratinib'} Traceback (most recent call last): File "", line 1, in File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse.py, line 65, in parse out = parse_gctx.parse(file_path, convert_neg_666=convert_neg_666, File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse_gcx.py", line 107, in parse (sorted_ridx, sorted_cidx) = check_and_order_id_inputs(rid, ridx, cid, cidx, row_meta, col_meta) File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse_gcx.py", line 146, in check_and_order_id_inputs col_ids = check_and_convert_ids(col_type, col_ids, col_meta_df) File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse_gcx.py", line 179, in check_and_convert_ids check_id_validity(id_list, meta_df) File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse_gcx.py", line 195, in check_id_validity raise Exception("parse_gctx check_id_validity " + msg) Exception: parse_gctx check_id_validity some of the ids being used to subset the data are not present in themetadata for the file being parsed - mismatch_ids: {'neratinib'}

How can I fix this problem???