Explanation of 'L1000 Characteristic Direction Up and Down Gene Sets (Level 5)'

songsong0425 commented 1 year ago

Dear MaayanLab team,

Greetings, I hope this question finds you well. Actually, it is neither reporting issues nor idea suggestions, I have a simple question about the data in the here. I checked that LINCS L1000 Chemical Perturbations (2021).gmt in L1000 Characteristic Direction Up and Down Gene Sets (Level 5) section contains gene sets for compounds, but I'm not sure how I understand the first column. For example, do about 250 genes in the first row mean '10uM of afatinib up-regulates genes from RRAGC to HMOX1'? If it is correct, why there are only 250 genes in the gene set per compound? In addition, what does the second column mean?

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ABY001_A375_XH_A13_afatinib_10uM up | NaN | RRAGC | VPS8 | KCNJ2 | DUSP5 | BEX1 | LIMS1 | RAB13 | COLEC12 | ... | GEM | BEX3 | VEGFA | NUPR1 | TSPAN6 | CA12 | TMEM158 | CHI3L1 | CDC20 | HMOX1 ABY001_A375_XH_A13_afatinib_10uM down | NaN | PCNA | S100A7 | DNMT1 | S100P | S100A9 | PUF60 | TMEM45A | GOLT1B | ... | AKAP12 | SCEL | CTSC | EAF2 | MORF4L1 | KCNK3 | MYB | MAF | LTF | MFNG ABY001_A375_XH_A14_erlotinib_10uM up | NaN | CDKN2C | TIMP3 | COLGALT2 | AMPD3 | TGFB1 | SERPINB3 | MMP7 | PIGR | ... | RHOBTB3 | TMSB15A | RPS4Y1 | JCHAIN | PLAT | FAM129A | ASNS | PCNA | TNFSF10 | CHAC1 ABY001_A375_XH_A14_erlotinib_10uM down | NaN | PCP4 | SPINK1 | STEAP1 | HOXC6 | ITGB1BP1 | MRPS16 | XIST | UCHL1 | ... | PEG3 | WBP1L | SCG5 | ATP5F1E | CCL19 | EGLN1 | MAST4 | ATP6V1H | GPX2 | EBP

Thank you for reading!

sxie04 commented 1 year ago

Hello @songsong0425, thank you for your question. Your understanding is correct, in that the first row represents the 250 genes that were most up-regulated in that particular signature (A375 cell line treated with 10uM of afatinib in the ABY001 batch of experiments), and the second row represents the 250 genes that were most down-regulated in that signature. The rest of the gene sets should all follow this pattern.

The method we used to compute the signatures was the characteristic direction method, which does not have a standard measure of significance for gene-specific differential expression. Therefore, we chose to use a cutoff of 250 for all of the gene sets, based on the characteristic direction coefficients.

The second column of a GMT file is optionally used for descriptions of gene sets, which we chose not to include here to reduce the filesize.

songsong0425 commented 1 year ago

Thank you for your kind reply! I fully understand about the table. Have a nice day!

songsong0425 commented 1 year ago

Sorry for the additional question, but I have a question about the dataset. Are the values from the third column sorted in descending order? Is it okay to extract the top 100 genes per row assuming the higher probability?

sxie04 commented 1 year ago

Hi @songsong0425, the GMT format does not really store magnitude/value information, and is more intended to contain "binary" information, in the sense that a gene is either included or not included within a given gene set, so we can only guarantee that each gene set contains the top 250 relevant genes, but not necessarily the order of those genes.

However, I do believe the gene sets were generated using CD-coefficient in order depending on the direction of the gene set -- i.e. the up gene sets were the top 250 when sorted with descending order, and the down gene sets were the top 250 when sorted with ascending order. You can try just taking the first 100 genes from each row under this assumption, but again the gene set format does not guarantee the ordering.

The only way to be certain would be to download the full signature matrix (see the L1000 Characteristic Direction Coefficient Tables (Level 5) section on the Downloads page) and re-compute the top 100 genes yourself from the signatures you are interested in.

songsong0425 commented 1 year ago

Thank you for your answer! Although I'm in trouble while running the cmapPy to open the .gctx file, I asked to cmapPy team. By the way, to calculate the DEGs for compounds, should I need the normal samples? If so, where can I find it? I'm sorry for bothering you. :(

sxie04 commented 1 year ago

@songsong0425 The Level 5 L1000 Characteristic Direction Coefficient Tables are already computed DEG signatures. The values are the characteristic direction coefficients: positive values indicate a gene is up-regulated in the treatments compared to the controls, and negative values indicate down-regulation. Hope this helps!

songsong0425 commented 1 year ago

Thank you for your kind reply! It was a precious chance for me to have a discussion about the LINCS L1000 dataset. Have a nice day! :)

MaayanLab / sigcom-lincs

Explanation of 'L1000 Characteristic Direction Up and Down Gene Sets (Level 5)' #76