MoTrPAC / motrpac-frontend

User interface and Frontend for MoTrPAC Bioinformatics Center
MIT License
3 stars 0 forks source link

Create mapping table with feature_id and vial_label (PASS1B-06, prot-pr) #126

Open jimmyzhen opened 3 years ago

jimmyzhen commented 3 years ago

To connect the PASS1B-06 DEA results data to the phenotypic data, we need to map each unique feature_id (present in the DEA results) to a list of associated vial_label (present in the phenotypic data) for each of the tissues in a given assay.

The 2 column mapping table potentially consist of just the feature_id and vial_labels fields, while the source file from which the needed data can be extracted for prot-pr is from the *-prot-pr_ratio-results.txt result file produced by pipeline.

jimmyzhen commented 2 years ago

@cteng585, I had the opportunity to review the _feature_id to viallabels lookup/mapping tables you created for the prot-pr and prot-ph tissues.

Did you combine the 2 assays into one table for a given tissue? If so, can you elaborate on the reason of doing so?

Furthermore, please validate the feature_ids values in these tables. Can you elaborate on the reason for the some of the feature_ids values containing instances of NaN, 353256.0 (a float type), NP_001001504.1|XP_006249164.1|XP_006249165.1..., or even strings that don't appear to be feature IDs?

Lastly, my understanding of our objective is that the mapping would be one unique feature_id associated with a list of vial_labels. So can you elaborate on your decision of defining the feature_ids value as an array (or a tuple) type?

cteng585 commented 2 years ago

@jimmyzhen Proteomics tables are different from metabolomics and genomics tables in that there appeared to be multiple identifying columns, and I wasn't sure what the most "identifying" column is. As a placeholder, there is an array type for the feature_id currently.

For example, for a given feature, it might be identified by a NCBI Protein ID (NP_001001504.1), an NCBI Reference Sequence (XP_006249164.1), an Entrez ID (353256), or a UniProt ID (Q5U2Y1). This also doesn't consider the PTM ID that the data uses. Since I'm waiting for confirmation on what the "best" feature ID to use would be (i.e. what we want to allow users to search for), this array is a placeholder array of potential feature_id values.

Metabolomics and genomics don't have this issue.

jimmyzhen commented 2 years ago

@cteng585, as we discussed today during the call, the timewise and training DEA results tables for prot-pr or prot-ph assays are good source of reference in terms of determining the "most identifying" column. In a nutshell, anything that don't match the patterns of feature IDs in those DEA tables are not very useful in linking DEA results to phenotypic data.

Furthermore, the feature_id value would ideally be a unique feature ID in string data type.

jimmyzhen commented 2 years ago

@cteng585 I've taken a look at a selected sample of the revised prot-pr lookup tables. Great work! Thank you!

cteng585 commented 2 years ago

Update:

  1. Revised file input to be from the analysis directory for genomics data. Per the pipelines team, using files from analysis should remove any outlier data points or samples removed for other QC reasons.

  2. Due to Nicole's work on mapping feature IDs the consortium uses to more standardized features, the alternative_ids key has been removed from the JSON output. JSON output now only has two keys:

    • feature_id: a string identifying the specific feature ID
    • vial_labels: an array containing the vial_labels expressing a specific feature ID
  3. feature_ID will be the protein_id column of the normalized-imputed-logratio table as those are the IDs used as features for the DEA.

  4. Code for the mapping table script can be found in the bic-infrastructure-utilities repo.