Closed Aryllen closed 3 years ago
Question: Are these just tools used in the analysis?
Yes, some are, but some are not. The ID keys can probably be deprecated, unless they are referenced somewhere.
Question: Why are there metadata/covariate files that never made it out of staging?
Use these to create the metadata files.
Question: What's the purpose of the key mappings (CER, TCX)?
This was for their use. These can probably be deprecated.
In progress: Gathered information from covariate files for the individual metadata. Uploaded current version to staging. Need to add MayoBrainBank
as an individualIdSource
. Need to verify that I am using the correct value for individualId
.
Question Proteomics: not using specimenIDs in the multispecimen file. Do we leave it as is, request them to use the correct IDs, or change them ourselves (with permission, of course)?
We should change the specimenIDs in the metadata and annotations to match those used in the data.
Proteomics assay file uploaded to staging.
Question GWAS covariates: This file has the first 10 eigenvectors from analysis. What should we do with this file? Trim it down to individualIDs and the vectors?
Just leave this one here, as is.
Question WGS metadata: while it makes sense to have specimenID since the information is from a specific specimen, it also makes sense/is intuitive to use the individualID. I have individualIDs for these, but no specimenID. Should we just use the individualID as the specimenID for these 'samples' or should we have only the individualID in the WGS metadata (no specimenID column)? I would probably vote for using the individualID as the specimenID and adding rows to the biospecimen metadata file to account for this data. I currently have this in place in the assay metadata. However, I would need to add the 'specimens' to the biospecimen metadata.
Question WGS libraryPreparationMethod: Kapa? Which Kapa? We don't have Kapa in the dictionary.
Need to ask them for a link to which one was used.
WGS metadata file uploaded to staging.
Question: Similar to WGS metadata, we don't have specimenID for snpArray. If I make these specimenIDs the same as the individualIDs, then it would look like duplicate rows in the biospecimen metadata, which is not helpful. Not sure what we should do here.
Use individualID as specimenID. Have a 'notes' column in the biospecimen metadata that shows which assay they were for.
snpArray metadata file uploaded to staging. Note that it has individualID as specimenID. The individuals included are from the covariate file, which excludes individuals that should not be removed from the data. I also put the platform as Illumina_Omni2pt5M
, which seems correct based on the assay description and a google of that platform. However, they also mention '+ Exome' in the assay description that I am not sure about.
Question: We don't have Braak and CERAD. There are diagnoses in the data, but unsure if we should annotate with these. We can either stick with one method of determining/annotating diagnosis (via Braak or CERAD) or annotate with the information they give us.
Meagan did a first pass on annotating with diagnosis. Will need to double check. Use the information they gave since their diagnoses were based on Braak and CERAD.
These are the only two Synapse pages (related to this study) referenced in this paper, temporal cortex RNA seq and cerebellum RNA seq. These are fine to be split up like this and do not need to be moved.
Pulled path aging metadata out of MayoRNAseq metadata. Uploaded to new folder in staging.
Found 5 specimenIDs that are annotated on raw RNAseq data, but are not represented in the biospecimen metadata.
1203_CER, 132_CER, 132_TCX, 844_CER, 844_TCX
Based on our previous conversation, I pulled the specimenIDs used in the multispecimen proteomics data and mapped these to the current specimenIDs. Uploaded mapping here. There's one specimen, b1_091, that has two mappings: b1_091_20a, b1_091_20b. I did check to see if this could be mapped to the proteomics metadata and it looks good, outside of the one specimen with two mappings. After getting approval for mapping, still need to:
Individual metadata
Updated annotations on raw proteomics data (minus the new specimenID mapping as mentioned above). Still need to check over annotations on the other dataTypes and analysis proteomics data.
Files that would be deprecated: Proteomics
RNAseq
snpArray
WGS
Prep for call with Mayo group Draft slides made for call.
Annotations Updated subset of the annotations. Updating RNAseq bam file annotations crashed my RStudio instance. Will need to finish the rest.
Misc Need to make start of individual metadata for PathAging study.
Annotations Finished updating annotations on raw and processed data with some exceptions below.
Question Imputed files are password protected. People are not going to find the password via the Portal and it's pointless to have a password at all if the password is public.
Created and uploaded draft MayoRNAseq_PathAging individual metadata file.
Merged the MayoRNAseq and PathAging metadata files with permission from Mariet and Mette. Updated the slides with a more representative structure for MCADGS, which will be the study surfaced on the portal instead of the other three smaller sets of data. Waiting on approval from Nilifur to move forward on the changes discussed with Mariet.
Mariet gave an initial review of the metadata and had a lot of comments. Will go over.
Question: The individual metadata file has 'no cognitive impairment' for the controls to match the data dictionary terms. However, they would rather this be 'control' since they 'do not know the actual antemortem cognitive status'. I have been thinking about this myself and am wondering if we should have a diagnosis value specifically for control.
Question: Add BannerSunHealth as an individualIdSource?
Updated biospecimen, individual, and snpArray metadata files based on feedback from Mariet. Sent her info on the update and some questions.
Question: Proteomics multispecimen file not using specimenIDs. Specimen values are things like 'b1_081' and 'b1_02', but column names refer to them as 'mayo_b1_081_43' and 'mayo_b1_egis_02'. These can technically be parsed from the dataset by a clever data user. What do we want to do about this?
Leave as is.
Question: What is the extra data in the Staging folder (ADSP, FutureData)?
Mette to review.
Question: Good with breaking up the Analysis wiki here with 'Show References' as separate?
Looks good
Question: WGS wiki references covariate file. Remove reference or just reformat?
Remove reference.
Main bits left: annotations, updating wiki formating, making everything live and deprecating old stuff after final approval.
@amapeters, when you get a chance, can you spot check the annotations and answer the questions in this comment?
Annotation notes rnaSeq
snpArray
wgs
proteomics
metadata
edit:
@amapeters, when you get a chance, can you spot check the annotations and answer the questions in this comment?
- Renamed 'notes' column in biospecimen metadata to 'assay'.
Looks good
Annotation notes rnaSeq
- only added Dx for deceased patients (ones with ageDeath in individual metadata) for whom we had Dx.
The 5 individuals with missing age death should also be deceased since they are from the MayoBrainBank. Have you confirmed with Mayo that this information does not exist?
- added sex for subjects we had data for.
- updated other missing annotations that we were missing data for and fixed ones that didn't quite match dictionary terms.
- Question: What should we do with the QC data files? The exclusions are more than just sample swaps (should I add 'exclude' and 'excludeReason' -- would mean rechecking with Mariet/Nilufer?). They will currently show up in the metadata section. Probably should not deprecate these. TCX, CER
We should if possible combine these into the biospecimen metadata file as exclude and exclude reason. The files from samples coded as sex-mismatch were removed from the dataset early on in the study. Please ask if we should remove these specimenIDs from the biospecimen metadata file and references to the sex discrepancy in the wiki, or leave them in with the exclude reason notation
snpArray
- Question: Imputed data would be considered experimentalData or analysis resourceType? They are currently analysis with analysisType of gene imputation.
resourceType = Analysis analysisType = gene imputation.
- Question: Plink files would be considered raw or processed dataSubtype? They are currently processed, but seem like they should be raw.
They raw data from Illumina SNP of gene expression microarray are IDAT files . Plink files would therefore be considered processed
- Added platform and filled out some missing annotations.
wgs
- Question: The files in the JointStudyAnalysis folder should be annotated with...? WGS_Harmonization? Both that and MayoRNAseq? Currently only has MayoRNAseq.
The variant calling from the WGS was done 3 ways.
- QC files should probably not be deprecated, either. Not completely sure what the annotations should be for these. Updated the annots as far as my knowledge went. Question: Annotations for QC files that won't be deprecated?
I recommend annotating these as: resourceType = Analysis analysisType = quality control assay = wgs
proteomics
- updated diagnosis and platform for analysis file
metadata
- added annotations
- added study annot rnaSeqReprocessing and WGS_Harmonization to relevant metadata
edit:
- updated annotations on files in Analysis folder, as well.
- added all wikis to portal
- added related studies (rnaSeqReprocessing, WGS_Harmonization)
- Question: I removed NCI from diagnosis on the study card since that's not true. However, should I add control? Additionally, are these supposed to be abbreviations?
Leave the control out, and the abbreviations as is.
Thanks, @amapeters! Email sent to Mariet and Nilufer regarding QC and missing ageDeath.
Regarding the missing ageDeath for 5 individuals:
Updates:
@amapeters, can you please:
@amapeters, can you please:
- check that I didn't forget to deprecate something that should have been deprecated?
Do we need this file: https://www.synapse.org/#!Synapse:syn9782771
- look at this file and let me know if this should be considered dataType = 'metadata'? It seems reasonable, but it shows up in the metadata section on the portal and I am not sure that's desired.
We have been annotating dictionaries as metadata, which I think is appropriate. What I suggest adding is resourceType = analysis and analysisType = quality control. That way it will show up with queries for the analysis file it belongs to
- check out this file. It's metadata for mice in this analysis. I completely missed this file. Should this be transferred into the metadata, as well?
No, let's not do that. Let me know instead if I can help updating the wiki for the Human mouse comparative analysis so that it doesn't link to Synapse and makes it clear where the mouse data comes from (ie, this https://www.synapse.org/#!Synapse:syn8650958)
Made a few updates based on Mette's suggestions.
Closing since this is done.
Study folder: syn5550404
We can expand these checks to be more specific, or mark them off/remove them if they are not relevant.
Folder Structure
Metadata (within file) Checks for each metadata file:
Metadata (across files)
Annotations
Multispecimen Files Check that specimenIDs in files match IDs in metadata
- [ ] Proteomics -- not using specimenIDs in the file.Leave as isWikis
Clinical data
- [ ] Braak and CERAD is available on donors with postmortem tissue -- can ask, but may not have- [ ] Permission to use Braak and CERAD to generate Dx (AD, NCI, Other) for data contributor-- using their diagnosis that was given, where 'control' = 'no cognitive impairment'Access (Human)
Portal
- [ ] MODEL-AD data specific: There is a link on the experimental tool card(s) to the studyNot a MODEL-AD study