Closed Aryllen closed 3 years ago
Let's add the following to the list of things to check
Clinical data
- [x] Braak and CERAD is available on donors with postmortem tissue
- [ ] Permission to use Braak and CERAD to generate Dx (AD, NCI, Other) for data contributor
Access (Human)
- [ ] Genomic summary results are in 'Analysis' folder which does not have access control
Portal
- [ ] Review content on the study card for accuracy
- [ ] Review text formatting and 'Show More' section: ### for header, bold for sub-headers, Show More section broken up in a consistent manner on the card
- [ ] Related studies are linked
- [ ] Study has an acknowledgement statement (wikis here)
- [ ] MODEL-AD data specific: There is a link on the experimental tool card(s) to the study
Edited by Nicole: moved the quoted items to the main issue so that it counted it in the progress bar.
To do specifics Metadata that needs to be updated to current schema
- [ ] Biospecimen - this file may need to be updated with specimens from WGS and Proteomics. Review across all assays
- [x] RNAseq - syn6100548 needs to be converted to a RNAseq metadata file. Note the remap information. This is based on a sample identity QC they did remapping and excluding some samples. Meagan reviewed this, but we should double check that the biospecimen file maps the samples to the correct remapped individual
- [x] WES - syn6101472. Same issue as for RNAseq
- [x] WGS - syn11384608. Same issue as for RNAseq
- [x] label free Proteomics - syn6100412 Edited by Nicole: moved the quoted items to the main issue so that it counted it in the progress bar.
New methylationArray metadata file uploaded to staging. Added missing column arrayBatch and rearranged columns to match template order. This also has Sample_Name column, which I believe is the ID used in the methylation data, but is mapped to specimenIDs. Will need to verify and perhaps make a note of this in the methods.
Update: The Sample_Name column is what is used in the methylation data. Users will need to get the specimenID by cross-reference with Sample_Name. If I remember correctly, I think the Sample_Names were projids, which is why we had them create a mapping to new specimenIDs in the first place.
@amapeters, @karawoo, what route will we take on these cases? We could change the multi-specimen file to use the specimenID or we could make a note in the methods section (or some other visible location) that the IDs in the data refer to Sample_Name.
@amapeters, I'm looking at the rnaSeq covariates file. I'm trying to understand the different variables mentioned on the wiki and figure out where this data should "live." A lot of it looks like it should be in the individual metadata file (have not checked, yet) and the biospecimen metadata file. However, I'm confused as to what should be done with the individualIdentifer and individualIdentifier.inferred. One appears to be the individualID that the sample should have been from and the other appears to be the individualID that analysis from the assay says it should have been from.
Which of these IDs is the one that should be associated with the specimenID in the biospecimen metadata?
Ignore this. Redid the label free mass spec proteomics and comments can be found in new reply to this.
Label free mass spec proteomics (new metadata file here):
- CaseID
was in the biospecimen metadata (with the exception of CaseID
0) so I made CaseID
= specimenID
- 18 specimens where in batches 7 and 8. I currently have 'batch' as comma separated for these since there is no other batch information.
- This file has some rows that were in the proteomics covariates file, but have no information other than a RunName
(?), batch
number, and a CaseID
of 0. Not sure what we should do with this data. CaseID
0 is not a specimenID
in the biospecimen metadata.
- platform
and assay
were taken from the assay description.
- Data that should have been in the individual metadata file was matched by individualIdentifier
and checked for consistency. There is data in the covariates file that is not in the individual file, however (bbscore
, PlaqueMean
, NP1
).
Regarding individualID vs individualID.inferred -- I believe the inferred ones should be more reliable. If I recall correctly, these came from resequencing data and discovering that some samples had been mislabeled or otherwise mixed up (@amapeters might remember better).
Per discussion with Nicole, I will take on the WGS data first
Kara and I discussed some of the issues with the proteomics data. I will need to take the RunName
, separate off the last section and compare that to the biospecimen specimenId
and annotations. The CaseId
0 data does have files, but they are annotated with a portion of RunName
as the specimenId
.
Added WGS file here: https://www.synapse.org/#!Synapse:syn22360825
The original WGS covariates file (https://www.synapse.org/#!Synapse:syn11384608) has the sampleIdentifier column which I remapped to specimenID. Other than that, none of the original columns seemed relevant to the assay metadata (they were all individual- or biospecimen-level information). I added platform and assay based on the assay description.
Mette and I discussed this today.
She requested that we focus on WES, WGS, and RNAseq. This data was done by Bin Zhang at Sinai (Ming Wui is data liason). Make sure that the biospecimen metadata maps to the individualIdentifer.inferred. Double check that the covariates data matches the data in the individual/biospecimen metadata.
RNA Seq metadata New rnaSeq metadata file uploaded here.
Notes:
batch
as sequencingBatch
since there were duplicates. Not sure if this was the right choice between rnaBatch
, libraryBatch
, and sequencingBatch
.libraryPreparationMethod
is TruSeq
based on assay descriptionplatform
is HiSeq2500
based on assay descriptionrunType
is singleEnd
based on assay descriptionlibraryPrep
is rRNAdepletion
based on assay descriptionreadLength
is 100
based on assay descriptionExtra columns moved to metadata (outside the scope of the template requirements):
The specimenIDs in the biospecimen metadata appear to be matched with the individualIdentifier.inferred value with the exception of the following specimenIDs. The corresponding individualIdentifier.inferred values are "." in the covariates file, but the individualID listed is the individualIdentifier. Generally, this means the Action is "exclude" for this specimen's rnaSeq data.
BM_22_245_H154B394, BM_22_93_S113B355, hB_RNA_10432_K77C014, hB_RNA_11012, hB_RNA_12302, hB_RNA_12392_E007C014, hB_RNA_12744_L43C014, hB_RNA_13039_B82C014, hB_RNA_13320_P60C014, hB_RNA_13373, hB_RNA_13609_P60C014, hB_RNA_4782_L43C014, hB_RNA_4991, hB_RNA_5001, hB_RNA_7995_E007C014, hB_RNA_8255, hB_RNA_8475, hB_RNA_8515_K85C014, hB_RNA_8525_K85C014, hB_RNA_8855, hB_RNA_9140_K75rC014, hB_RNA_9190_E007C014, hB_RNA_9208_resequenced, hB_RNA_9226_K82C014
Gene counts Raw and normalized count file concerns:
@amapeters, what do you suggest for these count file issues?
We also have both bam and fastq files for the raw data. I renamed the folder for now since it just said BAM.
WES metadata New wes metadata file uploaded here.
Notes:
sampleIdentifier
used as specimenID
assay
is exomeSeq
platform
is HiSeq2500
according to descriptionrunType
is pairedEnd
according to descriptionreadLength
is 125
according to descriptionExtra columns moved to metadata (outside the scope of the template requirements):
The specimenIDs appear to be matched with the individualIdentifier.inferred as the individualID in biospecimen metadata. There are 4 exceptions where the inferred value is NA, in which case the individualID in biospecimen metadata is the individualIdentifier. These are:
BM_22_837, BM_22_912, BM_22_941, BM_22_956
There are two specimenIDs that are missing from the biospecimen metadata: hB_DNA_12775, BM_22_985. These have the Action of Exclude in the WES covariates file. I added these to the biospecimen metadata file. New biospecimen metadata file is here. Please use and create a new version of this new file for all future cleaning updates.
WES multispecimen files use the specimenIDs listed in the assay.
WGS multispecimen vcf files (checked chromosome 21) appear to use the specimenIDs in the assay metadata.
The following WGS specimens have "unknown" as the individualID in the biospecimen metadata.
71729, 71823, 71843, 71962, 76354, 76655
All of these except for 76354 have individualIdentifier values in the WGS covariates file. Updated the individualIDs in the biospecimen metadata file and uploaded new version. 76354 will have to stay as "unknown" for now.
Label free proteomics metadata Metadata file here.
Notes:
specimenID
taken from portion of the RunName
assay
and platform
were taken from assay descriptionCaseID
and batch
since I am not sure if these are relevant or notMost specimens were not in the biospecimen metadata. The missing samples were added and a new biospecimen metadata file uploaded. For these, the organ and tissue were gathered from the annotations on the proteomics data. The individualIDs were all in the individual metadata.
The following specimens in the biospecimen metadata file have "unknown" as the individualID and there is no individualIdentifier in the covariates file.
b1_bmgis_01, b1_bmgis_22, b1_bmgis_43, b2_bmgis_01, b2_bmgis_22, b2_bmgis_43, b3_bmgis_01, b3_bmgis_22, b3_bmgis_43, b4_bmgis_01, b4_bmgis_22, b4_bmgis_43, b5_bmgis_01, b5_bmgis_22, b5_bmgis_43, b6_bmgis_01, b6_bmgis_22, b6_bmgis_43, b7_bmgis_01, b7_bmgis_22, b7_bmgis_43, r2b7_bmgis_01
The proteomics covariates file has three columns of individual metadata that is not included in the actual individual metadata file: bbscore, NP1, and PlaqueMean. I have asked @amapeters if we should deprecate this information when we deprecate the covariates file or add extra columns to the individual metadata, even though we do not have this data for all patients.
The multispecimen file uses the form "Peptides specimenID". Do we want to remove the excess "Peptides " bit from these, @amapeters?
@karawoo, do you know how to open the TMT proteomics files (syn21347564) to see what they are using for identifying the specimens? The pepxml files are not opening for me (tried pepXMLtab from bioconductor so far), and I am not sure how to read the raw files at all. I tried googling for a second, but figured I would ask you before I go too deep in the weeds.
I do not :(
individual metadata Removed row with individualID = unknown. There was no other information in that row. Accidentally pushed new version to the original file. 🤦♀️ Also uploaded to the staging cleaning folder in case it needs future updating. Will pull/push to the staged version, instead.
biospecimen metadata
Reminder to me: All individuals have ageDeath, but the biospecimen metadata is blank in the isPostMortem column. Should add 'True'.
TMT Proteomics annotations
resourceType
= analysis
. I changed to experimentalData
.species
updated to Human
fileformat
for pepxml files set as xml
isConsortiumAnalysis
= False
Label free mass spec
analysisType
isConsortiumAnalysis
= False
RNA seq annotations
isConsortiumAnalysis
= False
tissue
, diagnosis
, BrodmannArea
libraryPrep
, runType
, readLength
, dataSubtype
WES annotations raw data
NA
for now (not sure how to easily remove annotation keys from all files)platform
, dataSubtype
, isMultiSpecimen
, organ
, grant
, tissue
, diagnosis
updatedisConsortiumAnalysis
, runType
, readLength
processed data
analysisType
dataSubType
, organ
, platform
, resourceType
, runType
, readLength
, tissue
, isConsortiumAnalysis
WGS annotations
isMultiSpecimen
, tissue
dataType
'experimentalData', and analysisType
'NA'. Changed to this although I feel like this needs more discussion.isConsortiumAnalysis
should be since this shows joint and individual "analysis" on the folder names. Leaving off since we are questionable about this annotation anyway.assay
for label free mass spec proteomics to label free mass spectrometry
.resourceType
to tool
for LFMS proteomics search helper files.controlType
in TMT quantitation assay file to have GIS
for "Unknown" SampleID
and NA
for all others. New version here: syn22912257Sent email to Bin and Minghui, with changes requested from Mette and slight reformatting.
Hi Bin and Minghui,
As you know, we have been working on getting older studies up to date on the new standards that we have for data in the AD Knowledge Portal. In general, this means ensuring the metadata matches our metadata templates, and checking that the specimen identifiers (specimenID) in files link back to the metadata.
Since MSBB is one of our most valuable datasets, we want to make sure it is usable for as many people as possible and can easily be harmonized with the other portal data. To complete this, we have a few questions.
We noticed that the RNA sequencing counts files have a couple issues. The header appears to have been shifted left, which makes the first specimen name the header for the ensemble IDs. The specimenIDs also appear to be prepended with batch information, which is already in the metadata files. Would it be acceptable to you if we: shifted the header to the right by 1 (this would align the specimen names with the count columns), added 'ensemble id' as the first header value to signify what is in that column, and removed the prepended batch information from the specimen names (the specimenID would be leftover and could be matched to the batch information in the metadata files)?
We have taken the covariate files and created new metadata files from these. Information related to individuals has been moved to the individual metadata file, while information related to biospecimens and their respective assays have been moved to the biospecimen and assay metadata files. These files will replace the covariate files currently in the metadata folder. The covariate files will still be accessible, but we will not surface them in the AD Knowledge Portal. Can you please check the the metadata files for data that your group generated and verify that they appear correct? The metadata files are in a staging folder, but Minghui should have permissions to view them.
Along with checking the metadata files, please note that we have kept the information about the QC remapping in the biospecimen file. Since the remapped specimens are now linked to the correct individual, can we remove the remapping information? We are a bit concerned that it will be confusing. We suggest leaving the 'Exclude' information in, but would like to add text explaining why you recommend they should be excluded. Can we state the following? "A specimen QC identified samples that could not be mapped to the expected individual. These have been indicated as 'Exclude" in the biospecimen metadata file".
Best,
Minghui's response:
Nicole, Great your team can reformat the data files.
- Regarding the RNA sequencing counts files, it is totally fine if you shift the header and add in a new column id. Batch information can be discarded from the sample id as long as such information is kept in the associated metadata file.
- I will look at the metadata file when I have a chance.
- Fine with me.
Minghui
Biospecimen file has been updated to remove all "Action"/remapping values except for "Exclude" and individualIDs for the excluded samples have been removed. Moved the metadata wiki info to the staging folder since it might be confusing if people stumbled on that before we updated things. Changed note about the "Action" column to be the information about some being excluded as Mette suggested.
RNA seq count files have been updated and uploaded to a staging folder (syn22988362):
Proteomics metadata
TMT quantitation metadata
Updated study in portal studies table.
Wiki layouts
WGS:
Annotations I have already checked the annotations, but we have since updated the requirements to prepare for creating a schema. We should use the schema (when it's ready) to audit these annotations, again.
Update on MulticonsensusStudyFileID
: This definition is tied to the use of a specific program. They did not use it and do not have this information. This column should be empty in the TMT metadata.
Email sent to Duc to check over the proteomics metadata.
Have not heard a 'go-ahead' from Minghui regarding the rest of the metadata, yet.
INPP5D (in the staging folder) is a different study that is from a grant we support. Need to move it into it's own study folder. @amapeters to follow up with Minghui on when this should be published. Will need to verify that it fits our model (metadata, descriptions, folder structure, etc). Created issue in AD-DCC to track (https://github.com/Sage-Bionetworks/AD-DCC/issues/692). Update: moved out of MSBB staging.
Minghui sent corrected biospecimen and RNAseq metadata files. The corrections in the RNAseq metadata are things that I wouldn't know how to verify (sequencingBatch, RIN, etc). It seems like Minghui ended up shuffling the rows. The biospecimen metadata is a problem in that Minghui removed 290 rows, all of which were for proteomics data except for 1. Again, it looks like there was also a shuffling of rows. I am not sure what exactly was changed in each row, yet, but did respond to Minghui asking for a reason why he removed the data.
Update: Minghui claims that these were duplicates, but it's specimens used by the proteomics data. I said I would take a closer look, but will most likely need to tell Minghui, 'thanks for the update, but we are keeping them'.
I reviewed the MSBB biospecimen metadata file: https://www.synapse.org/#!Synapse:syn22453847
Hi Duc,
We are doing some cleanup of our high value studies, including MSBB. We noticed that the sample identifier in the protein output file is flipped compared to the identifier on the .raw files (which we use as the specimen ID). For example, MSBB_Proteomics_PFC_RAW_b1_1497_21.raw ('b1_1497_21') is labeled as 'b1_21_1497' in the protein output file. Do you mind if we update the protein output file with the raw file ids? We will provide a new version (ie, not delete the current file) and provide an explanation in our release notes.
Best, Mette
Hi Mette,
Thanks for doing these cleanup steps. It really helps as we are sometimes blinded to these discrepancies. Please go ahead with the name change.
Best, Duc
Proteomics
Biospecimen
Ready to deprecate old covariate files and move metadata to public.
@amapeters, I checked and the specimenID in question IS in the assay metadata. However, it exactly matches the information for hB_RNA_10892_K77C014. I will ask Minghui if the data should be annotated with the longer specimenID (the one in the biospecimen metadata).
There was a discussion thread where Minghui clarified the issue. I updated the names of the files and the annotations to match the batch information, as specified by Minghui.
This is 'technically' finished. Final touches would be to:
Made a few updates based on Mette's feedback.
Closing because this is now done.
Study folder: syn3159438
We can expand these checks to be more specific, or mark them off/remove them if they are not relevant.
Folder Structure
Metadata (within file) Checks for each metadata file:
Metadata (across files)
Annotations
Multispecimen Files Check that specimenIDs in files match IDs in metadata
- [ ] Proteomics - TMT (all data is multi specimen): syn21347564Leave as isWikis
Clinical data
Access (Human)
Portal
- [ ] MODEL-AD data specific: There is a link on the experimental tool card(s) to the studyNot a MODEL-AD studyMetadata specifics