bigomics / playbase

Core back-end functionality and logic for OmicsPlayground
Other
4 stars 0 forks source link

Replace BioMart by annotationhub for gene annotation #109

Closed mauromiguelm closed 4 months ago

mauromiguelm commented 6 months ago

requires https://github.com/bigomics/omicsplayground/pull/948

This pull request adds new annotationhub functions and improves the gene annotation process. It also includes updates, fixes, and tests related to the annotation process.

Differences between old vs. new annotation:

Problem: Some gene_titles are listed as "uncharacterized protein", while in biomart we had NAs Solution: Keep the uncharacterized protein and pass the test, I could alternativelly set uncharacterized protein to NA in gene annotation function.

> expect_equal(result$gene_title, data$gene_title)
Error: result$gene_title (`actual`) not equal to data$gene_title (`expected`).

result$gene_title (`actual`) not equal to data$gene_title (`expected`).

     actual                    | expected                             
 [1] "upstream of RpIII128"    | "upstream of RpIII128" [1]           
 [2] "Bomanin Tailed 3"        | "Bomanin Tailed 3"     [2]           
 [3] "uncharacterized protein" - NA                     [3]           
 [4] "uncharacterized protein" - NA                     [4]           
 [5] "uncharacterized protein" - NA                     [5]           
 [6] "uncharacterized protein" - NA                     [6]           
 [7] "uncharacterized protein" - NA                     [7]           
 [8] "uncharacterized protein" - NA                     [8]           
 [9] "uncharacterized protein" - NA                     [9]   

Problem: differences in gene biotype Solution: replaced - by _ to match old versions of playbase

Failure (test-pgx-ensembl.R:51:5): ngs.getGeneAnnotation returns correct annotation for Drosophila melanogaster
result$gene_biotype (`actual`) not equal to data$gene_biotype (`expected`).

`actual`:   "protein-coding" "protein-coding" "protein-coding" "protein-coding" "protein-coding" "protein-coding" "protein-coding" "protein-coding" "protein-coding" "protein-coding" and 10 more...
`expected`: "protein_coding" "protein_coding" "protein_coding" "protein_coding" "protein_coding" "protein_coding" "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...    

Problem: differences in human_orthologs Solution: all species passed 80% match between old vs. new playbase. We do have some features without human orthologs in the new version, as seen below. To get to 100% will be very difficult.

Failure (test-pgx-ensembl.R:101:5): ngs.getGeneAnnotation returns correct annotation for Drosophila melanogaster
result$human_ortholog (`actual`) not equal to data$human_ortholog (`expected`).

`actual[9:18]`:   "" "" "" ""      ""       ""     ""      "" "RFC3" "FOXL2"
`expected[9:18]`: "" "" "" "EIF4E" "HEXIM2" "LIPF" "NAA16" "" "RFC3" "FOXL2"

Problem: gene_name does not match old vs new for rat (only ANNOTHUB). Solution: Replaced gene_name from symbol to feature, this would be a breaking change otherwise.

Failure (test-pgx-ensembl.R:88:5): ngs.getGeneAnnotation returns correct annotation for Rat
result$gene_name (`actual`) not equal to data$gene_name (`expected`).

     actual               | expected                  
 [1] "ENSRNOG00000005935" - "A3galt2"  [1]            
 [2] "ENSRNOG00000008709" - "Arhgap32" [2]            
 [3] "ENSRNOG00000006470" - "Camk1g"   [3]            
 [4] "ENSRNOG00000010018" - "Clec4a3"  [4]            
 [5] "ENSRNOG00000019810" - "Des"      [5]            
 [6] "ENSRNOG00000001348" - "Erp29"    [6]            
 [7] "ENSRNOG00000055236" - "Gemin4"   [7]            
 [8] "ENSRNOG00000006472" - "Hspa2"    [8]            
 [9] "ENSRNOG00000009779" - "Krt8"     [9]            
[10] "ENSRNOG00000012297" - "Marchf8"  [10]   

Problem: UNIPROT is read as ACCNUM Solution: Keep UNIPROT as ACCNUM, as annotation results from each probe type are the same

uniprot_genes <- c("P31749", "P04637", "Q9Y6K9", "O15111", "Q9UM73", "Q13315", "P55317", "P16070", "P22301")

expect_true(playbase::detect_probetype.ANNOTHUB(organism = "Human", probes = uniprot_genes) %in% c("UNIPROT"))

Failure (test-pgx-ensembl.R:182:3): detects UNIPROT
... %in% c("UNIPROT") is not TRUE

`actual`:   FALSE
`expected`: TRUE