MoTrPAC / motrpac-frontend

User interface and Frontend for MoTrPAC Bioinformatics Center
MIT License
3 stars 0 forks source link

Create mapping table with feature_id and vial_label (PASS1B-06, rna-seq) #125

Open jimmyzhen opened 2 years ago

jimmyzhen commented 2 years ago

To connect the PASS1B-06 DEA results data to the phenotypic data, we need to map each unique feature_id (present in the DEA results) to a list of associated vial_label (present in the phenotypic data) for each of the tissues in a given assay.

The 2 column mapping table potentially consist of just the feature_id and vial_labels fields, while the source file from which the needed data can be extracted for rna-seq is from the *-rna-seq_normalized-log-cpm.txt result file produced by pipeline.

cteng585 commented 2 years ago

For RNA-seq thefeature_id was taken as the ENSEMBL Gene ID

The following process was initially taken to map unique feature_ids to vial labels:

  1. Each set of counts tables from the results directory was taken. Decision was made to use all tables and not just the normalized-log-cpm table since some feature_id to vial label mappings were missing from the normalized-log-cpm table.
  2. Any rows with all 0 or NA values was eliminated
  3. The first row of the table was taken as the list of vial labels
  4. Sparse table was converted to a list of JSONs with two keys: feature_ids and viallabels. feature_ids corresponds to a list of any ID-like strings that refer to a particular feature. viallabels corresponds to a list of samples which show a particular feature.

The script used to do the above can be found in the infrastructure utilities repo.

jimmyzhen commented 2 years ago

@cteng585, thank you for the _feature_id to viallabels lookup/mapping tables for the rna-seq tissues!

Suggestions to naming convention for the following keys in any subsequent tables you will create for other assays:

Furthermore, my understanding is that, at least for rna-seq, the mapping would be one unique feature_id associated with a list of vial-labels. So can you elaborate on your decision of defining the feature_ids value as a list of any ID-like strings that refer to a particular feature? What is "any ID-like strings" referring to?

Lastly, since you used all tables in this regard, I wanna confirm that you accounted for potential duplicate feature IDs spanning across different tables.

jimmyzhen commented 2 years ago

@cteng585 I've taken a look at a selected sample of the revised rna-seq lookup tables. Great work! Thank you!

cteng585 commented 2 years ago

Update:

  1. Revised file input to be from the analysis directory for genomics data. Per the pipelines team, using files from analysis should remove any outlier data points or samples removed for other QC reasons.

  2. Due to Nicole's work on mapping feature IDs the consortium uses to more standardized features, the alternative_ids key has been removed from the JSON output. JSON output now only has two keys:

    • feature_id: a string identifying the specific feature ID
    • vial_labels: an array containing the vial_labels expressing a specific feature ID
  3. feature_ID will be the first column of the normalized-log-cpm table as there is not a well-labeled feature ID column

  4. Code for the mapping table script can be found in the bic-infrastructure-utilities repo.