Updated analysis: update `annotator` submodule for v7 data release

logstar commented 3 years ago

What analysis module should be updated and why?

The long-format-table-utils/annotator submodule needs to be updated for v7 data release.

Add Uberon annotations for GTEx tissue groups and subgroups.
Change ensg-hugo-rmtl-v1-mapping.tsv to ensg-hugo-rmtl-mapping.tsv.
Confirm that the v7 annotation data files conform to the requirements by the long-format-table-utils/annotator submodule.

What changes need to be made? Please provide enough detail for another participant to make the update.

Update long-format-table-utils/annotator/annotator-api.R. Add tests for new code and interface in long-format-table-utils/annotator/tests/test_annotate_long_format_table.R.
Update long-format-table-utils/annotator/annotator-cli.R. Add tests for new code and interface in long-format-table-utils/annotator/tests/test_annotator_cli.R.
Update long-format-table-utils/README.md.

What input data should be used? Which data were used in the version being updated?

(Updated) data/ensg-hugo-rmtl-mapping.tsv
(New) data/uberon-map-gtex-group.tsv
(New) data/uberon-map-gtex-subgroup.tsv
analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv
(Updated) data/efo-mondo-map.tsv
analyses/fusion_filtering/references/genelistreference.txt

When do you expect the revised analysis will be completed?

2-4 days after https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/61 is merged.

Who will complete the updated analysis?

@logstar

logstar commented 3 years ago

@jharenza I have a few questions about the specific implementations of updating annotator for v7.

Should I only require the columns that need to be joined by the input table and annotation table? This would handle GTEx->Uberon mapping and Disease/cancer_group->EFO/MONDO mapping without providing both GTEx and Disease columns, so it would be convenient for the developers of analysis modules that only have Disease column or GTEx column. Currently, all Gene_symbol, Gene_Ensembl_ID, and Disease are required in the input table, even if they are not used for joining annotation tables, in order to ensure that all tables have the required columns. However, it seems that SNV, CNV, and fusion tables will not need the GTEx columns, as the tables are created only using tumor samples.
Should I add all of the following mappings, according to uberon-map-gtex-group.tsv and uberon-map-gtex-subgroup.tsv?
- gtex_subgroup -> uberon_code
- gtex_subgroup -> efo_code
- gtex_subgroup -> uberon_description_gtex_subgroup
- gtex_group -> uberon_code
Do the following column names that would be used in the input/output tables look good?
- GTEx_tissue_subgroup for gtex_subgroup
- UBERON for uberon_code, in order to be consistent with previous EFO and MONDO columns in the current annotator interface
- EFO for efo_code
- UBERON_description_GTEx_tissue_subgroup for uberon_description_gtex_subgroup
- GTEx_tissue_group for gtex_group

jharenza commented 3 years ago

Hi @logstar

Should I only require the columns that need to be joined by the input table and annotation table?

Yes, this would be good

Gene_symbol, Gene_Ensembl_ID, and Disease

We can have these as column names in each of the modules to make this easier for annotator use

UBERON_description_GTEx_tissue_subgroup for uberon_description_gtex_subgroup

This is not necessary - I left it in there for tracking/historical purposes from @sangeetashukla's searching, so please do not add.

The rest looks good to me!

logstar commented 3 years ago

Hi @logstar

Should I only require the columns that need to be joined by the input table and annotation table?

Yes, this would be good

Gene_symbol, Gene_Ensembl_ID, and Disease

We can have these as column names in each of the modules to make this easier for annotator use

UBERON_description_GTEx_tissue_subgroup for uberon_description_gtex_subgroup

This is not necessary - I left it in there for tracking/historical purposes from @sangeetashukla's searching, so please do not add.

The rest looks good to me!

Thank you for the reply. I will update accordingly.

logstar commented 3 years ago

@jharenza I think I would need to have two EFO annotation columns, Disease_EFO and GTEx_tissue_subgroup_EFO, in order to handle table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module.

However, this would break the backward compatibility of annotation column names, so I would like to bring this up here before working on it. I will add a changelog section in the README.md to record this change.

jharenza commented 3 years ago

table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module. table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module.

Should we ask about this in the #ot-portal-content channel in slack? ie - how do they want them in the JSONL files?

logstar commented 3 years ago

table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module. table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module.

Should we ask about this in the #ot-portal-content channel in slack? ie - how do they want them in the JSONL files?

I agree. I wonder if the DESeq table header is available outside the box, as I do not have box access yet. Sorry if I missed the link anywhere before. I will check the header before asking in the #ot-portal-content channel in slack.

jharenza commented 3 years ago

I agree. I wonder if the DESeq table header is available outside the box, as I do not have box access yet. Sorry if I missed the link anywhere before. I will check the header before asking in the #ot-portal-content channel in slack.

I am cc-ing @sangeetashukla and @afarrel here whether they can provide - I also asked @sangeetashukla to update her module with a sample output file

logstar commented 3 years ago

@jharenza Thank you for checking.

Assuming there is no Excel header file, I referred to the latest DESeq PR, and the EFO column is only for cancer_group, as shown in the code below. I also assume that the UBERRON code will be added using the annotator API, so they are not available in the current Final_Data_Table.

Final_Data_Table <- data.frame(
  datasourceId <- paste(strsplit(histology_filtered[I],split="_")[[1]][1],"vs_GTex",sep="_"),
  datatypeId <- "rna_expression",
  cohort <- paste(unique(hist$cohort[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]),collapse=";",sep=";"),
  gene_symbol <- rownames(Result),
  gene_id <- ENSG_Hugo$ensg_id[match(rownames(Result),ENSG_Hugo$gene_symbol)],
  RMTL <- ENSG_Hugo$rmtl[match(rownames(Result),ENSG_Hugo$gene_symbol)],
  EFO <- ifelse(length(which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))) >= 1, EFO_MONDO$efo_code[which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))], "" ),
  MONDO <- ifelse(length(which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))) >= 1,EFO_MONDO$mondo_code[which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))],""),
  comparisonId <- gsub(" |/|;|:|\\(|)","_",paste(histology_filtered[I],GTEX_filtered[J],sep="_v_")),
  cancer_group <- paste(unlist(strsplit(histology_filtered[I],split="_"))[-1],collapse=" "),
  cancer_group_Count <- Cancer.Hist_Hits,
  GTEx <- GTEX_filtered[J],
  GTEx_Count <- GTEX_Hits,
  cancer_group_MeanTpm <- Histology_MEAN_TPMs,
  GTEx_MeanTpm <- GTEX_MEAN_TPMs,
  Result, stringsAsFactors = FALSE
)#Final_Data_Table = data.frame(

(Link to the code)

Therefore, I think we could ask in the #ot-portal-content channel about whether EFO code is needed for both Disease/cancer_group and GTEx_tissue_subgroup/gtex_subgroup.

@afarrel and @sangeetashukla, I was wondering if you have any suggestions on how the annotator API should be implemented for you to use in your code. Currently, the issue is that both Disease/cancer_group and GTEx_tissue_subgroup/gtex_subgroup have corresponding EFO code, and I was wondering if I need to create both Disease_EFO and GTEx_tissue_subgroup_EFO annotation columns. This issue is also related to the expected attribute names in the DESeq JSONL file, so we might need to discuss in the #ot-portal-content channel as well.

logstar commented 3 years ago

@jharenza I will work on adding v7 GTEx annotations without breaking backward compatibility, as it would cost too much of efforts for all module developers to update, PR, and review. This ticket is also blocking all GTEx JSONL modules.

I will keep using EFO for cancer_group EFO and add GTEx_tissue_subgroup_EFO as a new annotation column. If the input table has both Disease/cancer_group and GTEx_tissue_subgroup/gtex_subgroup, adding EFO will only add cancer_group EFO, and GTEx_tissue_subgroup_EFO needs to be also specified for adding it.

If the discussion favors other solutions, I will update the upcoming PR accordingly.

jharenza commented 3 years ago

Ok sounds good. I didn't realize until we made those gtex lists that the cells are in EFO and not UBERON, which is unfortunately complicating things. For now, we can just use Uberon for GTEX and ask FNL on Wednesday how to handle the cells for gtex, so we don't do too much work for those two tissues.

logstar commented 3 years ago

Ok sounds good. I didn't realize until we made those gtex lists that the cells are in EFO and not UBERON, which is unfortunately complicating things. For now, we can just use Uberon for GTEX and ask FNL on Wednesday how to handle the cells for gtex, so we don't do too much work for those two tissues.

Thank you for the note! Good to know the reason for having those two EFO codes.

I will add the following mappings in the part 3 PR:

GTEx_tissue_group -> GTEx_tissue_group_UBERON
GTEx_tissue_subgroup -> GTEx_tissue_subgroup_UBERON

Then, the GTEx EFO handling will be added in the part 4 PR, if necessary, based on the discussions.

afarrel commented 3 years ago

There is a table with the result of a test run on the full set on the HPC cluster (Respublica):

/mnt/isilon/opentargets/DESeq2/DESEQ2_TABLE_V6.txt approx 65 GB.

I'll subset and create smaller example table when I get back to my computer.

jharenza commented 3 years ago

There is a table with the result of a test run on the full set on the HPC cluster (Respublica):

/mnt/isilon/opentargets/DESeq2/DESEQ2_TABLE_V6.txt

approx 65 GB.

I'll subset and create smaller example table when I get back to my computer.

@afarrel thanks- if you can just create a smaller table and gzip the file, and add to @sangeetashukla's PR, that would be great.

logstar commented 3 years ago

Hi @jharenza @afarrel @sangeetashukla . Just to note here that GTEx EFO codes will not be included in the annotator API, and they will also not be provided to the FNL team, because the cell lines may have no biological context for being searched on PedOT website, according to the 8am meeting this morning.

logstar commented 3 years ago

Closed with https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/71 merged.

d3b-center / ticket-tracker-OPC