Sage-Bionetworks / cleanAD

Tools for cleaning and organizing study data for the AD Knowledge Portal.
Other
0 stars 1 forks source link

MayoRNAseq #3

Closed Aryllen closed 3 years ago

Aryllen commented 4 years ago

Study folder: syn5550404

We can expand these checks to be more specific, or mark them off/remove them if they are not relevant.

Folder Structure

Metadata (within file) Checks for each metadata file:

file exists file name follows schema contents follow current template - deprecate old versions, if needed no duplicate individualID/specimenID as appropriate follows data dictionary

Metadata (across files)

Annotations

Multispecimen Files Check that specimenIDs in files match IDs in metadata

Wikis

Clinical data - [ ] Braak and CERAD is available on donors with postmortem tissue -- can ask, but may not have - [ ] Permission to use Braak and CERAD to generate Dx (AD, NCI, Other) for data contributor -- using their diagnosis that was given, where 'control' = 'no cognitive impairment'

Access (Human)

Portal

Aryllen commented 3 years ago

Question: Are these just tools used in the analysis?

Yes, some are, but some are not. The ID keys can probably be deprecated, unless they are referenced somewhere.

Question: Why are there metadata/covariate files that never made it out of staging?

Use these to create the metadata files.

Question: What's the purpose of the key mappings (CER, TCX)?

This was for their use. These can probably be deprecated.

In progress: Gathered information from covariate files for the individual metadata. Uploaded current version to staging. Need to add MayoBrainBank as an individualIdSource. Need to verify that I am using the correct value for individualId.

Aryllen commented 3 years ago

Question Proteomics: not using specimenIDs in the multispecimen file. Do we leave it as is, request them to use the correct IDs, or change them ourselves (with permission, of course)?

We should change the specimenIDs in the metadata and annotations to match those used in the data.

Proteomics assay file uploaded to staging.

Question GWAS covariates: This file has the first 10 eigenvectors from analysis. What should we do with this file? Trim it down to individualIDs and the vectors?

Just leave this one here, as is.

Question WGS metadata: while it makes sense to have specimenID since the information is from a specific specimen, it also makes sense/is intuitive to use the individualID. I have individualIDs for these, but no specimenID. Should we just use the individualID as the specimenID for these 'samples' or should we have only the individualID in the WGS metadata (no specimenID column)? I would probably vote for using the individualID as the specimenID and adding rows to the biospecimen metadata file to account for this data. I currently have this in place in the assay metadata. However, I would need to add the 'specimens' to the biospecimen metadata.

Question WGS libraryPreparationMethod: Kapa? Which Kapa? We don't have Kapa in the dictionary.

Need to ask them for a link to which one was used.

WGS metadata file uploaded to staging.

Aryllen commented 3 years ago

Question: Similar to WGS metadata, we don't have specimenID for snpArray. If I make these specimenIDs the same as the individualIDs, then it would look like duplicate rows in the biospecimen metadata, which is not helpful. Not sure what we should do here.

Use individualID as specimenID. Have a 'notes' column in the biospecimen metadata that shows which assay they were for.

snpArray metadata file uploaded to staging. Note that it has individualID as specimenID. The individuals included are from the covariate file, which excludes individuals that should not be removed from the data. I also put the platform as Illumina_Omni2pt5M, which seems correct based on the assay description and a google of that platform. However, they also mention '+ Exome' in the assay description that I am not sure about.

Question: We don't have Braak and CERAD. There are diagnoses in the data, but unsure if we should annotate with these. We can either stick with one method of determining/annotating diagnosis (via Braak or CERAD) or annotate with the information they give us.

Meagan did a first pass on annotating with diagnosis. Will need to double check. Use the information they gave since their diagnoses were based on Braak and CERAD.

Aryllen commented 3 years ago

These are the only two Synapse pages (related to this study) referenced in this paper, temporal cortex RNA seq and cerebellum RNA seq. These are fine to be split up like this and do not need to be moved.

Aryllen commented 3 years ago

Pulled path aging metadata out of MayoRNAseq metadata. Uploaded to new folder in staging.

Found 5 specimenIDs that are annotated on raw RNAseq data, but are not represented in the biospecimen metadata.

1203_CER, 132_CER, 132_TCX, 844_CER, 844_TCX

Based on our previous conversation, I pulled the specimenIDs used in the multispecimen proteomics data and mapped these to the current specimenIDs. Uploaded mapping here. There's one specimen, b1_091, that has two mappings: b1_091_20a, b1_091_20b. I did check to see if this could be mapped to the proteomics metadata and it looks good, outside of the one specimen with two mappings. After getting approval for mapping, still need to:

Individual metadata

Updated annotations on raw proteomics data (minus the new specimenID mapping as mentioned above). Still need to check over annotations on the other dataTypes and analysis proteomics data.

Aryllen commented 3 years ago

Files that would be deprecated: Proteomics

RNAseq

snpArray

WGS

Prep for call with Mayo group Draft slides made for call.

Annotations Updated subset of the annotations. Updating RNAseq bam file annotations crashed my RStudio instance. Will need to finish the rest.

Misc Need to make start of individual metadata for PathAging study.

Aryllen commented 3 years ago

Annotations Finished updating annotations on raw and processed data with some exceptions below.

Question Imputed files are password protected. People are not going to find the password via the Portal and it's pointless to have a password at all if the password is public.

Aryllen commented 3 years ago

Created and uploaded draft MayoRNAseq_PathAging individual metadata file.

Aryllen commented 3 years ago

Merged the MayoRNAseq and PathAging metadata files with permission from Mariet and Mette. Updated the slides with a more representative structure for MCADGS, which will be the study surfaced on the portal instead of the other three smaller sets of data. Waiting on approval from Nilifur to move forward on the changes discussed with Mariet.

Aryllen commented 3 years ago

Mariet gave an initial review of the metadata and had a lot of comments. Will go over.

Aryllen commented 3 years ago

Question: The individual metadata file has 'no cognitive impairment' for the controls to match the data dictionary terms. However, they would rather this be 'control' since they 'do not know the actual antemortem cognitive status'. I have been thinking about this myself and am wondering if we should have a diagnosis value specifically for control.

Question: Add BannerSunHealth as an individualIdSource?

Updated biospecimen, individual, and snpArray metadata files based on feedback from Mariet. Sent her info on the update and some questions.

Aryllen commented 3 years ago

Question: Proteomics multispecimen file not using specimenIDs. Specimen values are things like 'b1_081' and 'b1_02', but column names refer to them as 'mayo_b1_081_43' and 'mayo_b1_egis_02'. These can technically be parsed from the dataset by a clever data user. What do we want to do about this?

Leave as is.

Question: What is the extra data in the Staging folder (ADSP, FutureData)?

Mette to review.

Question: Good with breaking up the Analysis wiki here with 'Show References' as separate?

Looks good

Question: WGS wiki references covariate file. Remove reference or just reformat?

Remove reference.

Main bits left: annotations, updating wiki formating, making everything live and deprecating old stuff after final approval.

Aryllen commented 3 years ago

@amapeters, when you get a chance, can you spot check the annotations and answer the questions in this comment?

Annotation notes rnaSeq

snpArray

wgs

proteomics

metadata

edit:

amapeters commented 3 years ago

@amapeters, when you get a chance, can you spot check the annotations and answer the questions in this comment?

  • Renamed 'notes' column in biospecimen metadata to 'assay'.

Looks good

Annotation notes rnaSeq

  • only added Dx for deceased patients (ones with ageDeath in individual metadata) for whom we had Dx.

The 5 individuals with missing age death should also be deceased since they are from the MayoBrainBank. Have you confirmed with Mayo that this information does not exist?

  • added sex for subjects we had data for.
  • updated other missing annotations that we were missing data for and fixed ones that didn't quite match dictionary terms.
  • Question: What should we do with the QC data files? The exclusions are more than just sample swaps (should I add 'exclude' and 'excludeReason' -- would mean rechecking with Mariet/Nilufer?). They will currently show up in the metadata section. Probably should not deprecate these. TCX, CER

We should if possible combine these into the biospecimen metadata file as exclude and exclude reason. The files from samples coded as sex-mismatch were removed from the dataset early on in the study. Please ask if we should remove these specimenIDs from the biospecimen metadata file and references to the sex discrepancy in the wiki, or leave them in with the exclude reason notation

snpArray

  • Question: Imputed data would be considered experimentalData or analysis resourceType? They are currently analysis with analysisType of gene imputation.

resourceType = Analysis analysisType = gene imputation.

  • Question: Plink files would be considered raw or processed dataSubtype? They are currently processed, but seem like they should be raw.

They raw data from Illumina SNP of gene expression microarray are IDAT files . Plink files would therefore be considered processed

  • Added platform and filled out some missing annotations.

wgs

  • Question: The files in the JointStudyAnalysis folder should be annotated with...? WGS_Harmonization? Both that and MayoRNAseq? Currently only has MayoRNAseq.

The variant calling from the WGS was done 3 ways.

  • QC files should probably not be deprecated, either. Not completely sure what the annotations should be for these. Updated the annots as far as my knowledge went. Question: Annotations for QC files that won't be deprecated?

I recommend annotating these as: resourceType = Analysis analysisType = quality control assay = wgs

proteomics

  • updated diagnosis and platform for analysis file

metadata

  • added annotations
  • added study annot rnaSeqReprocessing and WGS_Harmonization to relevant metadata

edit:

  • updated annotations on files in Analysis folder, as well.
  • added all wikis to portal
  • added related studies (rnaSeqReprocessing, WGS_Harmonization)
  • Question: I removed NCI from diagnosis on the study card since that's not true. However, should I add control? Additionally, are these supposed to be abbreviations?

Leave the control out, and the abbreviations as is.

Aryllen commented 3 years ago

Thanks, @amapeters! Email sent to Mariet and Nilufer regarding QC and missing ageDeath.

Aryllen commented 3 years ago

Regarding the missing ageDeath for 5 individuals:

Updates:

@amapeters, can you please:

amapeters commented 3 years ago

@amapeters, can you please:

  • check that I didn't forget to deprecate something that should have been deprecated?

Do we need this file: https://www.synapse.org/#!Synapse:syn9782771

  • look at this file and let me know if this should be considered dataType = 'metadata'? It seems reasonable, but it shows up in the metadata section on the portal and I am not sure that's desired.

We have been annotating dictionaries as metadata, which I think is appropriate. What I suggest adding is resourceType = analysis and analysisType = quality control. That way it will show up with queries for the analysis file it belongs to

  • check out this file. It's metadata for mice in this analysis. I completely missed this file. Should this be transferred into the metadata, as well?

No, let's not do that. Let me know instead if I can help updating the wiki for the Human mouse comparative analysis so that it doesn't link to Synapse and makes it clear where the mouse data comes from (ie, this https://www.synapse.org/#!Synapse:syn8650958)

Aryllen commented 3 years ago
Aryllen commented 3 years ago

Made a few updates based on Mette's suggestions.

Closing since this is done.