Closed logstar closed 3 years ago
@jharenza I have a few questions about the specific implementations of updating annotator
for v7.
Gene_symbol
, Gene_Ensembl_ID
, and Disease
are required in the input table, even if they are not used for joining annotation tables, in order to ensure that all tables have the required columns. However, it seems that SNV, CNV, and fusion tables will not need the GTEx columns, as the tables are created only using tumor samples.gtex_subgroup
-> uberon_code
gtex_subgroup
-> efo_code
gtex_subgroup
-> uberon_description_gtex_subgroup
gtex_group
-> uberon_code
GTEx_tissue_subgroup
for gtex_subgroup
UBERON
for uberon_code
, in order to be consistent with previous EFO
and MONDO
columns in the current annotator interfaceEFO
for efo_code
UBERON_description_GTEx_tissue_subgroup
for uberon_description_gtex_subgroup
GTEx_tissue_group
for gtex_group
Hi @logstar
Should I only require the columns that need to be joined by the input table and annotation table?
Yes, this would be good
Gene_symbol, Gene_Ensembl_ID, and Disease
We can have these as column names in each of the modules to make this easier for annotator use
UBERON_description_GTEx_tissue_subgroup for uberon_description_gtex_subgroup
This is not necessary - I left it in there for tracking/historical purposes from @sangeetashukla's searching, so please do not add.
The rest looks good to me!
Hi @logstar
Should I only require the columns that need to be joined by the input table and annotation table?
Yes, this would be good
Gene_symbol, Gene_Ensembl_ID, and Disease
We can have these as column names in each of the modules to make this easier for annotator use
UBERON_description_GTEx_tissue_subgroup for uberon_description_gtex_subgroup
This is not necessary - I left it in there for tracking/historical purposes from @sangeetashukla's searching, so please do not add.
The rest looks good to me!
Thank you for the reply. I will update accordingly.
@jharenza I think I would need to have two EFO
annotation columns, Disease_EFO
and GTEx_tissue_subgroup_EFO
, in order to handle table rows that have both Disease
and GTEx_tissue_subgroup
like the DESeq module.
However, this would break the backward compatibility of annotation column names, so I would like to bring this up here before working on it. I will add a changelog section in the README.md to record this change.
table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module. table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module.
Should we ask about this in the #ot-portal-content
channel in slack? ie - how do they want them in the JSONL files?
table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module. table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module.
Should we ask about this in the
#ot-portal-content
channel in slack? ie - how do they want them in the JSONL files?
I agree. I wonder if the DESeq table header is available outside the box, as I do not have box access yet. Sorry if I missed the link anywhere before. I will check the header before asking in the #ot-portal-content
channel in slack.
I agree. I wonder if the DESeq table header is available outside the box, as I do not have box access yet. Sorry if I missed the link anywhere before. I will check the header before asking in the #ot-portal-content channel in slack.
I am cc-ing @sangeetashukla and @afarrel here whether they can provide - I also asked @sangeetashukla to update her module with a sample output file
@jharenza Thank you for checking.
Assuming there is no Excel header file, I referred to the latest DESeq PR, and the EFO column is only for cancer_group, as shown in the code below. I also assume that the UBERRON code will be added using the annotator API, so they are not available in the current Final_Data_Table
.
Final_Data_Table <- data.frame(
datasourceId <- paste(strsplit(histology_filtered[I],split="_")[[1]][1],"vs_GTex",sep="_"),
datatypeId <- "rna_expression",
cohort <- paste(unique(hist$cohort[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]),collapse=";",sep=";"),
gene_symbol <- rownames(Result),
gene_id <- ENSG_Hugo$ensg_id[match(rownames(Result),ENSG_Hugo$gene_symbol)],
RMTL <- ENSG_Hugo$rmtl[match(rownames(Result),ENSG_Hugo$gene_symbol)],
EFO <- ifelse(length(which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))) >= 1, EFO_MONDO$efo_code[which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))], "" ),
MONDO <- ifelse(length(which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))) >= 1,EFO_MONDO$mondo_code[which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))],""),
comparisonId <- gsub(" |/|;|:|\\(|)","_",paste(histology_filtered[I],GTEX_filtered[J],sep="_v_")),
cancer_group <- paste(unlist(strsplit(histology_filtered[I],split="_"))[-1],collapse=" "),
cancer_group_Count <- Cancer.Hist_Hits,
GTEx <- GTEX_filtered[J],
GTEx_Count <- GTEX_Hits,
cancer_group_MeanTpm <- Histology_MEAN_TPMs,
GTEx_MeanTpm <- GTEX_MEAN_TPMs,
Result, stringsAsFactors = FALSE
)#Final_Data_Table = data.frame(
(Link to the code)
Therefore, I think we could ask in the #ot-portal-content
channel about whether EFO code is needed for both Disease/cancer_group and GTEx_tissue_subgroup/gtex_subgroup.
@afarrel and @sangeetashukla, I was wondering if you have any suggestions on how the annotator API should be implemented for you to use in your code. Currently, the issue is that both Disease/cancer_group and GTEx_tissue_subgroup/gtex_subgroup have corresponding EFO code, and I was wondering if I need to create both Disease_EFO and GTEx_tissue_subgroup_EFO annotation columns. This issue is also related to the expected attribute names in the DESeq JSONL file, so we might need to discuss in the #ot-portal-content
channel as well.
@jharenza I will work on adding v7 GTEx annotations without breaking backward compatibility, as it would cost too much of efforts for all module developers to update, PR, and review. This ticket is also blocking all GTEx JSONL modules.
I will keep using EFO for cancer_group EFO and add GTEx_tissue_subgroup_EFO as a new annotation column. If the input table has both Disease/cancer_group and GTEx_tissue_subgroup/gtex_subgroup, adding EFO will only add cancer_group EFO, and GTEx_tissue_subgroup_EFO needs to be also specified for adding it.
If the discussion favors other solutions, I will update the upcoming PR accordingly.
Ok sounds good. I didn't realize until we made those gtex lists that the cells are in EFO and not UBERON, which is unfortunately complicating things. For now, we can just use Uberon for GTEX and ask FNL on Wednesday how to handle the cells for gtex, so we don't do too much work for those two tissues.
Ok sounds good. I didn't realize until we made those gtex lists that the cells are in EFO and not UBERON, which is unfortunately complicating things. For now, we can just use Uberon for GTEX and ask FNL on Wednesday how to handle the cells for gtex, so we don't do too much work for those two tissues.
Thank you for the note! Good to know the reason for having those two EFO codes.
I will add the following mappings in the part 3 PR:
GTEx_tissue_group
-> GTEx_tissue_group_UBERON
GTEx_tissue_subgroup
-> GTEx_tissue_subgroup_UBERON
Then, the GTEx EFO handling will be added in the part 4 PR, if necessary, based on the discussions.
There is a table with the result of a test run on the full set on the HPC cluster (Respublica):
/mnt/isilon/opentargets/DESeq2/DESEQ2_TABLE_V6.txt approx 65 GB.
I'll subset and create smaller example table when I get back to my computer.
There is a table with the result of a test run on the full set on the HPC cluster (Respublica):
/mnt/isilon/opentargets/DESeq2/DESEQ2_TABLE_V6.txt
approx 65 GB.
I'll subset and create smaller example table when I get back to my computer.
@afarrel thanks- if you can just create a smaller table and gzip the file, and add to @sangeetashukla's PR, that would be great.
Hi @jharenza @afarrel @sangeetashukla . Just to note here that GTEx EFO codes will not be included in the annotator API, and they will also not be provided to the FNL team, because the cell lines may have no biological context for being searched on PedOT website, according to the 8am meeting this morning.
Closed with https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/71 merged.
What analysis module should be updated and why?
The
long-format-table-utils/annotator
submodule needs to be updated for v7 data release.ensg-hugo-rmtl-v1-mapping.tsv
toensg-hugo-rmtl-mapping.tsv
.long-format-table-utils/annotator
submodule.What changes need to be made? Please provide enough detail for another participant to make the update.
long-format-table-utils/annotator/annotator-api.R
. Add tests for new code and interface inlong-format-table-utils/annotator/tests/test_annotate_long_format_table.R
.long-format-table-utils/annotator/annotator-cli.R
. Add tests for new code and interface inlong-format-table-utils/annotator/tests/test_annotator_cli.R
.long-format-table-utils/README.md
.What input data should be used? Which data were used in the version being updated?
data/ensg-hugo-rmtl-mapping.tsv
data/uberon-map-gtex-group.tsv
data/uberon-map-gtex-subgroup.tsv
analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv
data/efo-mondo-map.tsv
analyses/fusion_filtering/references/genelistreference.txt
When do you expect the revised analysis will be completed?
2-4 days after https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/61 is merged.
Who will complete the updated analysis?
@logstar