PDCMFinder / pdxfinder

PDX Finder performs integration, standardization, analysis and visualization of the complex and diverse data associated with PDX mouse models for the cancer community.
https://www.pdxfinder.org/
Apache License 2.0
6 stars 6 forks source link

Harmonize CNA and Expression templates #727

Closed Afollet closed 3 years ago

Afollet commented 3 years ago

Neither CNA or expression template headers are inconsistent throughout the database. These will needed to be harmonized to one standard header for use in postgres.

Furthermore, many of the columns in CNA and expression are sparse, redundant or only used for obscure formats. These will be considered to be omitted in the harmonized headers.

ACCEPTANCE CRITERION
Afollet commented 3 years ago

Get table of cyto to sequence here: http://genome.ucsc.edu/cgi-bin/hgTables Discussion of staining types: https://github.com/monarch-initiative/dipper/issues/43

Afollet commented 3 years ago

Pushing CNA templates. Remove LIH CNA data until it can be further harmonized.

Throughout these transformations I investigated the use of some of the columns in the data. All columns are used, somewhere:

gistic is used in PPTC picnic is used by charles river log10r and fold change I could only find in Jax. fold change seems more relevant than log10r, since it's not as commonly used in CNA. Also, if we had log2r it would be easy to resolve log10r.

P-value column was removed. It was not used anywhere and only on 2 sheets. I'm not sure it's intent or use in CNA data. It seems like Z-Score would be more applicable and easier to work with.

Afollet commented 3 years ago

Expression data is fine. I had to add column "ncbi_gene_id" to UOM-BC, UOC-BC and NKI expression sheets. CRL had to have "transcript_id" changed to "ensembl_transcript_id"

Afollet commented 3 years ago

@CsabaHalmagyi done