Open jimmyzhen opened 3 years ago
@cteng585, I had the opportunity to review the _feature_id to viallabels lookup/mapping tables you created for the prot-pr
and prot-ph
tissues.
Did you combine the 2 assays into one table for a given tissue? If so, can you elaborate on the reason of doing so?
Furthermore, please validate the feature_ids
values in these tables. Can you elaborate on the reason for the some of the feature_ids
values containing instances of NaN
, 353256.0
(a float type), NP_001001504.1|XP_006249164.1|XP_006249165.1...
, or even strings that don't appear to be feature IDs?
Lastly, my understanding of our objective is that the mapping would be one unique feature_id
associated with a list of vial_labels
. So can you elaborate on your decision of defining the feature_ids
value as an array (or a tuple) type?
@jimmyzhen
Proteomics tables are different from metabolomics and genomics tables in that there appeared to be multiple identifying columns, and I wasn't sure what the most "identifying" column is. As a placeholder, there is an array type for the feature_id
currently.
For example, for a given feature, it might be identified by a NCBI Protein ID (NP_001001504.1), an NCBI Reference Sequence (XP_006249164.1), an Entrez ID (353256), or a UniProt ID (Q5U2Y1). This also doesn't consider the PTM ID that the data uses. Since I'm waiting for confirmation on what the "best" feature ID to use would be (i.e. what we want to allow users to search for), this array is a placeholder array of potential feature_id
values.
Metabolomics and genomics don't have this issue.
@cteng585, as we discussed today during the call, the timewise
and training
DEA results tables for prot-pr
or prot-ph
assays are good source of reference in terms of determining the "most identifying" column. In a nutshell, anything that don't match the patterns of feature IDs in those DEA tables are not very useful in linking DEA results to phenotypic data.
Furthermore, the feature_id
value would ideally be a unique feature ID in string
data type.
@cteng585 I've taken a look at a selected sample of the revised prot-pr
lookup tables. Great work! Thank you!
Update:
Revised file input to be from the analysis
directory for genomics data. Per the pipelines team, using files from analysis should remove any outlier data points or samples removed for other QC reasons.
Due to Nicole's work on mapping feature IDs the consortium uses to more standardized features, the alternative_ids
key has been removed from the JSON output. JSON output now only has two keys:
feature_id
: a string identifying the specific feature IDvial_labels
: an array containing the vial_labels expressing a specific feature IDfeature_ID
will be the protein_id
column of the normalized-imputed-logratio
table as those are the IDs used as features for the DEA.
Code for the mapping table script can be found in the bic-infrastructure-utilities
repo.
To connect the PASS1B-06 DEA results data to the phenotypic data, we need to map each unique
feature_id
(present in the DEA results) to a list of associatedvial_label
(present in the phenotypic data) for each of the tissues in a given assay.The 2 column mapping table potentially consist of just the
feature_id
andvial_labels
fields, while the source file from which the needed data can be extracted forprot-pr
is from the*-prot-pr_ratio-results.txt
result file produced by pipeline.