d3b-center / ticket-tracker-OPC

A repo to generate and track tickets for ped OT
2 stars 0 forks source link

Updated analysis: update `annotator` submodule for v7 data release #132

Closed logstar closed 3 years ago

logstar commented 3 years ago

What analysis module should be updated and why?

The long-format-table-utils/annotator submodule needs to be updated for v7 data release.

What changes need to be made? Please provide enough detail for another participant to make the update.

What input data should be used? Which data were used in the version being updated?

When do you expect the revised analysis will be completed?

2-4 days after https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/61 is merged.

Who will complete the updated analysis?

@logstar

logstar commented 3 years ago

@jharenza I have a few questions about the specific implementations of updating annotator for v7.

jharenza commented 3 years ago

Hi @logstar

Should I only require the columns that need to be joined by the input table and annotation table?

Yes, this would be good

Gene_symbol, Gene_Ensembl_ID, and Disease

We can have these as column names in each of the modules to make this easier for annotator use

UBERON_description_GTEx_tissue_subgroup for uberon_description_gtex_subgroup

This is not necessary - I left it in there for tracking/historical purposes from @sangeetashukla's searching, so please do not add.

The rest looks good to me!

logstar commented 3 years ago

Hi @logstar

Should I only require the columns that need to be joined by the input table and annotation table?

Yes, this would be good

Gene_symbol, Gene_Ensembl_ID, and Disease

We can have these as column names in each of the modules to make this easier for annotator use

UBERON_description_GTEx_tissue_subgroup for uberon_description_gtex_subgroup

This is not necessary - I left it in there for tracking/historical purposes from @sangeetashukla's searching, so please do not add.

The rest looks good to me!

Thank you for the reply. I will update accordingly.

logstar commented 3 years ago

@jharenza I think I would need to have two EFO annotation columns, Disease_EFO and GTEx_tissue_subgroup_EFO, in order to handle table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module.

However, this would break the backward compatibility of annotation column names, so I would like to bring this up here before working on it. I will add a changelog section in the README.md to record this change.

jharenza commented 3 years ago

table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module. table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module.

Should we ask about this in the #ot-portal-content channel in slack? ie - how do they want them in the JSONL files?

logstar commented 3 years ago

table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module. table rows that have both Disease and GTEx_tissue_subgroup like the DESeq module.

Should we ask about this in the #ot-portal-content channel in slack? ie - how do they want them in the JSONL files?

I agree. I wonder if the DESeq table header is available outside the box, as I do not have box access yet. Sorry if I missed the link anywhere before. I will check the header before asking in the #ot-portal-content channel in slack.

jharenza commented 3 years ago

I agree. I wonder if the DESeq table header is available outside the box, as I do not have box access yet. Sorry if I missed the link anywhere before. I will check the header before asking in the #ot-portal-content channel in slack.

I am cc-ing @sangeetashukla and @afarrel here whether they can provide - I also asked @sangeetashukla to update her module with a sample output file

logstar commented 3 years ago

@jharenza Thank you for checking.

Assuming there is no Excel header file, I referred to the latest DESeq PR, and the EFO column is only for cancer_group, as shown in the code below. I also assume that the UBERRON code will be added using the annotator API, so they are not available in the current Final_Data_Table.

Final_Data_Table <- data.frame(
  datasourceId <- paste(strsplit(histology_filtered[I],split="_")[[1]][1],"vs_GTex",sep="_"),
  datatypeId <- "rna_expression",
  cohort <- paste(unique(hist$cohort[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]),collapse=";",sep=";"),
  gene_symbol <- rownames(Result),
  gene_id <- ENSG_Hugo$ensg_id[match(rownames(Result),ENSG_Hugo$gene_symbol)],
  RMTL <- ENSG_Hugo$rmtl[match(rownames(Result),ENSG_Hugo$gene_symbol)],
  EFO <- ifelse(length(which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))) >= 1, EFO_MONDO$efo_code[which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))], "" ),
  MONDO <- ifelse(length(which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))) >= 1,EFO_MONDO$mondo_code[which(EFO_MONDO$cancer_group == unique(hist$cancer_group[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)]))],""),
  comparisonId <- gsub(" |/|;|:|\\(|)","_",paste(histology_filtered[I],GTEX_filtered[J],sep="_v_")),
  cancer_group <- paste(unlist(strsplit(histology_filtered[I],split="_"))[-1],collapse=" "),
  cancer_group_Count <- Cancer.Hist_Hits,
  GTEx <- GTEX_filtered[J],
  GTEx_Count <- GTEX_Hits,
  cancer_group_MeanTpm <- Histology_MEAN_TPMs,
  GTEx_MeanTpm <- GTEX_MEAN_TPMs,
  Result, stringsAsFactors = FALSE
)#Final_Data_Table = data.frame(

(Link to the code)

Therefore, I think we could ask in the #ot-portal-content channel about whether EFO code is needed for both Disease/cancer_group and GTEx_tissue_subgroup/gtex_subgroup.

@afarrel and @sangeetashukla, I was wondering if you have any suggestions on how the annotator API should be implemented for you to use in your code. Currently, the issue is that both Disease/cancer_group and GTEx_tissue_subgroup/gtex_subgroup have corresponding EFO code, and I was wondering if I need to create both Disease_EFO and GTEx_tissue_subgroup_EFO annotation columns. This issue is also related to the expected attribute names in the DESeq JSONL file, so we might need to discuss in the #ot-portal-content channel as well.

logstar commented 3 years ago

@jharenza I will work on adding v7 GTEx annotations without breaking backward compatibility, as it would cost too much of efforts for all module developers to update, PR, and review. This ticket is also blocking all GTEx JSONL modules.

I will keep using EFO for cancer_group EFO and add GTEx_tissue_subgroup_EFO as a new annotation column. If the input table has both Disease/cancer_group and GTEx_tissue_subgroup/gtex_subgroup, adding EFO will only add cancer_group EFO, and GTEx_tissue_subgroup_EFO needs to be also specified for adding it.

If the discussion favors other solutions, I will update the upcoming PR accordingly.

jharenza commented 3 years ago

Ok sounds good. I didn't realize until we made those gtex lists that the cells are in EFO and not UBERON, which is unfortunately complicating things. For now, we can just use Uberon for GTEX and ask FNL on Wednesday how to handle the cells for gtex, so we don't do too much work for those two tissues.

logstar commented 3 years ago

Ok sounds good. I didn't realize until we made those gtex lists that the cells are in EFO and not UBERON, which is unfortunately complicating things. For now, we can just use Uberon for GTEX and ask FNL on Wednesday how to handle the cells for gtex, so we don't do too much work for those two tissues.

Thank you for the note! Good to know the reason for having those two EFO codes.

I will add the following mappings in the part 3 PR:

Then, the GTEx EFO handling will be added in the part 4 PR, if necessary, based on the discussions.

afarrel commented 3 years ago

There is a table with the result of a test run on the full set on the HPC cluster (Respublica):

/mnt/isilon/opentargets/DESeq2/DESEQ2_TABLE_V6.txt approx 65 GB.

I'll subset and create smaller example table when I get back to my computer.

jharenza commented 3 years ago

There is a table with the result of a test run on the full set on the HPC cluster (Respublica):

/mnt/isilon/opentargets/DESeq2/DESEQ2_TABLE_V6.txt

approx 65 GB.

I'll subset and create smaller example table when I get back to my computer.

@afarrel thanks- if you can just create a smaller table and gzip the file, and add to @sangeetashukla's PR, that would be great.

logstar commented 3 years ago

Hi @jharenza @afarrel @sangeetashukla . Just to note here that GTEx EFO codes will not be included in the annotator API, and they will also not be provided to the FNL team, because the cell lines may have no biological context for being searched on PedOT website, according to the 8am meeting this morning.

logstar commented 3 years ago

Closed with https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/71 merged.