MayoRNAseq - Githubissues

Aryllen commented 4 years ago

We can expand these checks to be more specific, or mark them off/remove them if they are not relevant.

Folder Structure

[x] Top 3 folders are Analysis, Data, Staging
[x] Top level folders within Data are based on DataType with the exception of Metadata
[x] Metadata folder is within Data
[x] Staging folder is clean (old data in a subfolder -- Archived)

Metadata (within file) Checks for each metadata file:

file exists file name follows schema contents follow current template - deprecate old versions, if needed no duplicate individualID/specimenID as appropriate follows data dictionary

[x] RNA seq
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] snpArray (GWAS)
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] WGS
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] Label free mass spec
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] individual
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate individualID
- [x] all individualIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] biospecimen
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all individualIDs in individual metadata
- [x] values follow data dictionary guidelines

Metadata (across files)

[x] no duplicate individualID/specimenID for different individuals/specimens

Annotations

[x] WGS
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] snpArray (GWAS)
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] RNAseq
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] Label free proteomics
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary

Multispecimen Files Check that specimenIDs in files match IDs in metadata

ADD multispecimen files here ~~- [ ] Proteomics -- not using specimenIDs in the file.~~ Leave as is

Wikis

[x] appear up to date
[x] are in correct location (on dataType folder)
[x] are referenced in portal Study table

Clinical data ~~- [ ] Braak and CERAD is available on donors with postmortem tissue -- can ask, but may not have~~ ~~- [ ] Permission to use Braak and CERAD to generate Dx (AD, NCI, Other) for data contributor~~ -- using their diagnosis that was given, where 'control' = 'no cognitive impairment'

Access (Human)

Add any special access needs/fixes/checks here

Portal

[x] Review content on the study card for accuracy
[x] Review text formatting and 'Show More' section: ### for header, bold for sub-headers, Show More section broken up in a consistent manner on the card
[x] Related studies are linked
[x] Study has an acknowledgement statement (wikis here) ~~- [ ] MODEL-AD data specific: There is a link on the experimental tool card(s) to the study~~ Not a MODEL-AD study

Aryllen commented 3 years ago

Question: Are these just tools used in the analysis?

Yes, some are, but some are not. The ID keys can probably be deprecated, unless they are referenced somewhere.

Question: Why are there metadata/covariate files that never made it out of staging?

Use these to create the metadata files.

Question: What's the purpose of the key mappings (CER, TCX)?

This was for their use. These can probably be deprecated.

In progress: Gathered information from covariate files for the individual metadata. Uploaded current version to staging. Need to add MayoBrainBank as an individualIdSource. Need to verify that I am using the correct value for individualId.

Aryllen commented 3 years ago

Question Proteomics: not using specimenIDs in the multispecimen file. Do we leave it as is, request them to use the correct IDs, or change them ourselves (with permission, of course)?

We should change the specimenIDs in the metadata and annotations to match those used in the data.

Proteomics assay file uploaded to staging.

Question GWAS covariates: This file has the first 10 eigenvectors from analysis. What should we do with this file? Trim it down to individualIDs and the vectors?

Just leave this one here, as is.

Question WGS metadata: while it makes sense to have specimenID since the information is from a specific specimen, it also makes sense/is intuitive to use the individualID. I have individualIDs for these, but no specimenID. Should we just use the individualID as the specimenID for these 'samples' or should we have only the individualID in the WGS metadata (no specimenID column)? I would probably vote for using the individualID as the specimenID and adding rows to the biospecimen metadata file to account for this data. I currently have this in place in the assay metadata. However, I would need to add the 'specimens' to the biospecimen metadata.

Question WGS libraryPreparationMethod: Kapa? Which Kapa? We don't have Kapa in the dictionary.

Need to ask them for a link to which one was used.

WGS metadata file uploaded to staging.

Aryllen commented 3 years ago

Question: Similar to WGS metadata, we don't have specimenID for snpArray. If I make these specimenIDs the same as the individualIDs, then it would look like duplicate rows in the biospecimen metadata, which is not helpful. Not sure what we should do here.

Use individualID as specimenID. Have a 'notes' column in the biospecimen metadata that shows which assay they were for.

snpArray metadata file uploaded to staging. Note that it has individualID as specimenID. The individuals included are from the covariate file, which excludes individuals that should not be removed from the data. I also put the platform as Illumina_Omni2pt5M, which seems correct based on the assay description and a google of that platform. However, they also mention '+ Exome' in the assay description that I am not sure about.

Question: We don't have Braak and CERAD. There are diagnoses in the data, but unsure if we should annotate with these. We can either stick with one method of determining/annotating diagnosis (via Braak or CERAD) or annotate with the information they give us.

Meagan did a first pass on annotating with diagnosis. Will need to double check. Use the information they gave since their diagnoses were based on Braak and CERAD.

Aryllen commented 3 years ago

These are the only two Synapse pages (related to this study) referenced in this paper, temporal cortex RNA seq and cerebellum RNA seq. These are fine to be split up like this and do not need to be moved.

Aryllen commented 3 years ago

Pulled path aging metadata out of MayoRNAseq metadata. Uploaded to new folder in staging.

Found 5 specimenIDs that are annotated on raw RNAseq data, but are not represented in the biospecimen metadata.

1203_CER, 132_CER, 132_TCX, 844_CER, 844_TCX

Based on our previous conversation, I pulled the specimenIDs used in the multispecimen proteomics data and mapped these to the current specimenIDs. Uploaded mapping here. There's one specimen, b1_091, that has two mappings: b1_091_20a, b1_091_20b. I did check to see if this could be mapped to the proteomics metadata and it looks good, outside of the one specimen with two mappings. After getting approval for mapping, still need to:

update specimenIDs in metadata
update specimenIDs in annotations

Individual metadata

changed F and M sex to female, male
changed diagnosis AD to Alzheimer Disease, control to no cognitive impairment, and PSP to progressive supranuclear palsy. Need to add term for pathologic aging.

Updated annotations on raw proteomics data (minus the new specimenID mapping as mentioned above). Still need to check over annotations on the other dataTypes and analysis proteomics data.

Aryllen commented 3 years ago

Files that would be deprecated: Proteomics

RNAseq

snpArray

idkey

WGS

covariates

Prep for call with Mayo group Draft slides made for call.

Annotations Updated subset of the annotations. Updating RNAseq bam file annotations crashed my RStudio instance. Will need to finish the rest.

Misc Need to make start of individual metadata for PathAging study.

Aryllen commented 3 years ago

Annotations Finished updating annotations on raw and processed data with some exceptions below.

There are some files I did not update annotations on, which are things like the QC and covariate files. The covariate files should be deprecated eventually. I think we should discuss what to do about the QC files (example).
- Answer: Need to figure out what we should annotate these with. Are they metadata? If so, what type?
snpArray: Not positive that the platform is correct. Will need confirmation from Mayo team first.
snpArray: variantIdkey files -- dataSubtype = processed? (example)
WGS: what would the QC files be considered? What should we do with them?
- Answer: Need to figure out what we should annotate these with. Are they metadata? If so, what type?
WGS: variant calling files fall hard on the line of being both analysis and experimentalData. We have an analysisType of 'variant calling', which makes me lean more towards analysis. Need to decide for sure what we want these to be.
- Yes, we should stick with analysis and analysisType for this type of data.
analysis folder: would prefer to do these annotations after we clean up the structure.

Question Imputed files are password protected. People are not going to find the password via the Portal and it's pointless to have a password at all if the password is public.

Answer: See how long it would take to download, unlock, and upload as new version. They are relatively small-ish files.
Solution: Don't do anything. Found that it says they are password protected, but they aren't.

Aryllen commented 3 years ago

Created and uploaded draft MayoRNAseq_PathAging individual metadata file.

Aryllen commented 3 years ago

Merged the MayoRNAseq and PathAging metadata files with permission from Mariet and Mette. Updated the slides with a more representative structure for MCADGS, which will be the study surfaced on the portal instead of the other three smaller sets of data. Waiting on approval from Nilifur to move forward on the changes discussed with Mariet.

Aryllen commented 3 years ago

Updated folder structure for Analysis data. Still need to add/format the top level wiki's for the portal
Went through Data wiki's and replaced all synIDs with links to the file (per Mariet's request) that show the name of the file (better for the portal).

Mariet gave an initial review of the metadata and had a lot of comments. Will go over.

Aryllen commented 3 years ago

Question: The individual metadata file has 'no cognitive impairment' for the controls to match the data dictionary terms. However, they would rather this be 'control' since they 'do not know the actual antemortem cognitive status'. I have been thinking about this myself and am wondering if we should have a diagnosis value specifically for control.

Yes, add a term, but what is definition?

Question: Add BannerSunHealth as an individualIdSource?

no. Mariet agrees that BannerSun is fine.

Updated biospecimen, individual, and snpArray metadata files based on feedback from Mariet. Sent her info on the update and some questions.

Aryllen commented 3 years ago

Question: Proteomics multispecimen file not using specimenIDs. Specimen values are things like 'b1_081' and 'b1_02', but column names refer to them as 'mayo_b1_081_43' and 'mayo_b1_egis_02'. These can technically be parsed from the dataset by a clever data user. What do we want to do about this?

Leave as is.

Question: What is the extra data in the Staging folder (ADSP, FutureData)?

Mette to review.

Question: Good with breaking up the Analysis wiki here with 'Show References' as separate?

Looks good

Question: WGS wiki references covariate file. Remove reference or just reformat?

Remove reference.

Main bits left: annotations, updating wiki formating, making everything live and deprecating old stuff after final approval.

Aryllen commented 3 years ago

@amapeters, when you get a chance, can you spot check the annotations and answer the questions in this comment?

Renamed 'notes' column in biospecimen metadata to 'assay'.

Annotation notes rnaSeq

only added Dx for deceased patients (ones with ageDeath in individual metadata) for whom we had Dx.
added sex for subjects we had data for.
updated other missing annotations that we were missing data for and fixed ones that didn't quite match dictionary terms.
Question: What should we do with the QC data files? The exclusions are more than just sample swaps (should I add 'exclude' and 'excludeReason' -- would mean rechecking with Mariet/Nilufer?). They will currently show up in the metadata section. Probably should not deprecate these. TCX, CER

snpArray

Question: Imputed data would be considered experimentalData or analysis resourceType? They are currently analysis with analysisType of gene imputation.
Question: Plink files would be considered raw or processed dataSubtype? They are currently processed, but seem like they should be raw.
Added platform and filled out some missing annotations.

wgs

Question: The files in the JointStudyAnalysis folder should be annotated with...? WGS_Harmonization? Both that and MayoRNAseq? Currently only has MayoRNAseq.
QC files should probably not be deprecated, either. Not completely sure what the annotations should be for these. Updated the annots as far as my knowledge went. Question: Annotations for QC files that won't be deprecated?

proteomics

updated diagnosis and platform for analysis file

metadata

added annotations
added study annot rnaSeqReprocessing and WGS_Harmonization to relevant metadata

edit:

updated annotations on files in Analysis folder, as well.
added all wikis to portal
added related studies (rnaSeqReprocessing, WGS_Harmonization)
Question: I removed NCI from diagnosis on the study card since that's not true. However, should I add control? Additionally, are these supposed to be abbreviations?

amapeters commented 3 years ago

@amapeters, when you get a chance, can you spot check the annotations and answer the questions in this comment?

Renamed 'notes' column in biospecimen metadata to 'assay'.

Looks good

Annotation notes rnaSeq

only added Dx for deceased patients (ones with ageDeath in individual metadata) for whom we had Dx.

The 5 individuals with missing age death should also be deceased since they are from the MayoBrainBank. Have you confirmed with Mayo that this information does not exist?

added sex for subjects we had data for.

updated other missing annotations that we were missing data for and fixed ones that didn't quite match dictionary terms.

Question: What should we do with the QC data files? The exclusions are more than just sample swaps (should I add 'exclude' and 'excludeReason' -- would mean rechecking with Mariet/Nilufer?). They will currently show up in the metadata section. Probably should not deprecate these. TCX, CER

We should if possible combine these into the biospecimen metadata file as exclude and exclude reason. The files from samples coded as sex-mismatch were removed from the dataset early on in the study. Please ask if we should remove these specimenIDs from the biospecimen metadata file and references to the sex discrepancy in the wiki, or leave them in with the exclude reason notation

snpArray

Question: Imputed data would be considered experimentalData or analysis resourceType? They are currently analysis with analysisType of gene imputation.

resourceType = Analysis analysisType = gene imputation.

Question: Plink files would be considered raw or processed dataSubtype? They are currently processed, but seem like they should be raw.

They raw data from Illumina SNP of gene expression microarray are IDAT files . Plink files would therefore be considered processed

Added platform and filled out some missing annotations.

wgs

Question: The files in the JointStudyAnalysis folder should be annotated with...? WGS_Harmonization? Both that and MayoRNAseq? Currently only has MayoRNAseq.

The variant calling from the WGS was done 3 ways.

Within each study: https://www.synapse.org/#!Synapse:syn11724002
Across all 3 studies where the files were divided up into one set per study: https://www.synapse.org/#!Synapse:syn11707308
Across all 3 studies where the files are combined: https://www.synapse.org/#!Synapse:syn11707420

QC files should probably not be deprecated, either. Not completely sure what the annotations should be for these. Updated the annots as far as my knowledge went. Question: Annotations for QC files that won't be deprecated?

I recommend annotating these as: resourceType = Analysis analysisType = quality control assay = wgs

proteomics

updated diagnosis and platform for analysis file

metadata

added annotations

added study annot rnaSeqReprocessing and WGS_Harmonization to relevant metadata

edit:

updated annotations on files in Analysis folder, as well.

added all wikis to portal

added related studies (rnaSeqReprocessing, WGS_Harmonization)

Question: I removed NCI from diagnosis on the study card since that's not true. However, should I add control? Additionally, are these supposed to be abbreviations?

Leave the control out, and the abbreviations as is.

Aryllen commented 3 years ago

Thanks, @amapeters! Email sent to Mariet and Nilufer regarding QC and missing ageDeath.

Aryllen commented 3 years ago

Regarding the missing ageDeath for 5 individuals:

The MayoRNAseq individuals that are missing ageDeath are on purpose. They don't want people to be using data from these individuals because they did not fit their criteria. So they removed the information in the metadata to discourage this. Specimens from these individuals do appear in the QC files. Once I add the QC info to the biospecimen metadata, this should explain why the individuals would not be used.

Updates:

Used QC files to generate 'exclude' and 'excludeReason' in biospecimen metadata. The QC files have 'QC Type' and a reason. I combined these for the 'excludeReason' to be of the format: (QC Type) - reason. Sent update email to Mariet in which she replied that it 'sounds good'.
Reviewed annotations regarding questions that were answered above.
Released metadata files.
Moved covariate and QC files to deprecated folder here.
Changed wiki references to deprecated files to references to the metadata.

@amapeters, can you please:

check that I didn't forget to deprecate something that should have been deprecated?
look at this file and let me know if this should be considered dataType = 'metadata'? It seems reasonable, but it shows up in the metadata section on the portal and I am not sure that's desired.
check out this file. It's metadata for mice in this analysis. I completely missed this file. Should this be transferred into the metadata, as well?

amapeters commented 3 years ago

@amapeters, can you please:

check that I didn't forget to deprecate something that should have been deprecated?

Do we need this file: https://www.synapse.org/#!Synapse:syn9782771

look at this file and let me know if this should be considered dataType = 'metadata'? It seems reasonable, but it shows up in the metadata section on the portal and I am not sure that's desired.

We have been annotating dictionaries as metadata, which I think is appropriate. What I suggest adding is resourceType = analysis and analysisType = quality control. That way it will show up with queries for the analysis file it belongs to

check out this file. It's metadata for mice in this analysis. I completely missed this file. Should this be transferred into the metadata, as well?

No, let's not do that. Let me know instead if I can help updating the wiki for the Human mouse comparative analysis so that it doesn't link to Synapse and makes it clear where the mouse data comes from (ie, this https://www.synapse.org/#!Synapse:syn8650958)

Aryllen commented 3 years ago

moved missed file to deprecated
updated QC data dictionary annotations as mentioned previously

Aryllen commented 3 years ago

Made a few updates based on Mette's suggestions.

Closing since this is done.

Sage-Bionetworks / cleanAD

MayoRNAseq #3