This pull request adds new annotationhub functions and improves the gene annotation process. It also includes updates, fixes, and tests related to the annotation process.
Differences between old vs. new annotation:
Problem: Some gene_titles are listed as "uncharacterized protein", while in biomart we had NAs
Solution: Keep the uncharacterized protein and pass the test, I could alternativelly set uncharacterized protein to NA in gene annotation function.
> expect_equal(result$gene_title, data$gene_title)
Error: result$gene_title (`actual`) not equal to data$gene_title (`expected`).
result$gene_title (`actual`) not equal to data$gene_title (`expected`).
actual | expected
[1] "upstream of RpIII128" | "upstream of RpIII128" [1]
[2] "Bomanin Tailed 3" | "Bomanin Tailed 3" [2]
[3] "uncharacterized protein" - NA [3]
[4] "uncharacterized protein" - NA [4]
[5] "uncharacterized protein" - NA [5]
[6] "uncharacterized protein" - NA [6]
[7] "uncharacterized protein" - NA [7]
[8] "uncharacterized protein" - NA [8]
[9] "uncharacterized protein" - NA [9]
Problem: differences in gene biotype
Solution: replaced - by _ to match old versions of playbase
Problem: differences in human_orthologs
Solution: all species passed 80% match between old vs. new playbase. We do have some features without human orthologs in the new version, as seen below. To get to 100% will be very difficult.
Problem: gene_name does not match old vs new for rat (only ANNOTHUB).
Solution: Replaced gene_name from symbol to feature, this would be a breaking change otherwise.
requires https://github.com/bigomics/omicsplayground/pull/948
This pull request adds new annotationhub functions and improves the gene annotation process. It also includes updates, fixes, and tests related to the annotation process.
Differences between old vs. new annotation:
Problem: Some gene_titles are listed as "uncharacterized protein", while in biomart we had NAs Solution: Keep the uncharacterized protein and pass the test, I could alternativelly set uncharacterized protein to NA in gene annotation function.
Problem: differences in gene biotype Solution: replaced - by _ to match old versions of playbase
Problem: differences in human_orthologs Solution: all species passed 80% match between old vs. new playbase. We do have some features without human orthologs in the new version, as seen below. To get to 100% will be very difficult.
Problem: gene_name does not match old vs new for rat (only ANNOTHUB). Solution: Replaced gene_name from symbol to feature, this would be a breaking change otherwise.
Problem: UNIPROT is read as ACCNUM Solution: Keep UNIPROT as ACCNUM, as annotation results from each probe type are the same