MSBB - Githubissues

Aryllen commented 4 years ago

We can expand these checks to be more specific, or mark them off/remove them if they are not relevant.

Folder Structure

[x] Top 3 folders are Analysis, Data, Staging
[x] Top level folders within Data are based on DataType with the exception of Metadata
[x] Metadata folder is within Data
[x] Staging folder is clean (old data in a subfolder -- Archived)
[x] Old covariate "metadata" files are deprecated (after new metadata files are added to replace them)
[x] Move INPP5D data out of MSBB

Metadata (within file) Checks for each metadata file:

file exists file name follows schema contents follow current template - deprecate old versions, if needed no duplicate individualID/specimenID as appropriate follows data dictionary

[x] RNA seq
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] ATAC seq
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] Methylation
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] WES
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] WGS
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] Label free mass spec -- NOTE: see comments below regarding this data since there are potentially issues
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] TMT proteomics
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all specimenIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] individual
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate individualID
- [x] all individualIDs in biospecimen metadata
- [x] values follow data dictionary guidelines
[x] biospecimen
- [x] metadata file exists and follows naming schema
- [x] column names follow current template
- [x] no duplicate specimenID
- [x] all individualIDs in individual metadata
- [x] values follow data dictionary guidelines
- [x] add Action from assay files

Metadata (across files)

[x] no duplicate individualID/specimenID for different individuals/specimens

Annotations

[x] WGS
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] WES
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] RNAseq
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] Label free proteomics
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] TMT proteomics
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] Methylation
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] ATACseq
- [x] match metadata information
- [x] are complete - remove unnecessary annotations, if needed
- [x] follow data dictionary
[x] archived staging data has the study name removed

Multispecimen Files Check that specimenIDs in files match IDs in metadata

[x] Methylation array (all data is multi specimen): syn21447661
- Uses Sample_Name in data, which is then mapped to a specimenID in the methylation metadata
[x] Gene expression raw counts: syn7391749
[x] Gene expression normalized: syn7391749
[x] WES processed: syn7538026
[x] WES analyzed (vcf): syn7538027
[x] WGS (vcf): syn10901600
[x] Proteomics-label free: syn6100414 ~~- [ ] Proteomics - TMT (all data is multi specimen): syn21347564~~ Leave as is

Wikis

[x] appear up to date
[x] are in correct location (on dataType folder)
[x] are referenced in portal Study table

Clinical data

[x] Braak and CERAD is available on donors with postmortem tissue
[x] Permission to use Braak and CERAD to generate Dx (AD, NCI, Other) for data contributor

Access (Human)

[x] Genomic summary results are in 'Analysis' folder which does not have access control

Portal

[x] Review content on the study card for accuracy
[x] Review text formatting and 'Show More' section: ### for header, bold for sub-headers, Show More section broken up in a consistent manner on the card
[x] Related studies are linked
[x] Study has an acknowledgement statement (wikis here) ~~- [ ] MODEL-AD data specific: There is a link on the experimental tool card(s) to the study~~ Not a MODEL-AD study
[x] Metadata wiki added to study table

Metadata specifics

[x] Biospecimen - this file may need to be updated with specimens from WGS and Proteomics. Review across all assays
[x] RNAseq - syn6100548 needs to be converted to a RNAseq metadata file. Note the remap information. This is based on a sample identity QC they did remapping and excluding some samples. Meagan reviewed this, but we should double check that the biospecimen file maps the samples to the correct remapped individual
[x] WES - syn6101472. Same issue as for RNAseq
[x] WGS - syn11384608. Same issue as for RNAseq
[x] label free Proteomics - syn6100412

amapeters commented 4 years ago

Let's add the following to the list of things to check

Clinical data

[x] Braak and CERAD is available on donors with postmortem tissue

[ ] Permission to use Braak and CERAD to generate Dx (AD, NCI, Other) for data contributor

Access (Human)

[ ] Genomic summary results are in 'Analysis' folder which does not have access control

Portal

[ ] Review content on the study card for accuracy

[ ] Review text formatting and 'Show More' section: ### for header, bold for sub-headers, Show More section broken up in a consistent manner on the card

[ ] Related studies are linked

[ ] Study has an acknowledgement statement (wikis here)

[ ] MODEL-AD data specific: There is a link on the experimental tool card(s) to the study

Edited by Nicole: moved the quoted items to the main issue so that it counted it in the progress bar.

amapeters commented 4 years ago

To do specifics Metadata that needs to be updated to current schema

[ ] Biospecimen - this file may need to be updated with specimens from WGS and Proteomics. Review across all assays

[x] RNAseq - syn6100548 needs to be converted to a RNAseq metadata file. Note the remap information. This is based on a sample identity QC they did remapping and excluding some samples. Meagan reviewed this, but we should double check that the biospecimen file maps the samples to the correct remapped individual

[x] WES - syn6101472. Same issue as for RNAseq

[x] WGS - syn11384608. Same issue as for RNAseq

[x] label free Proteomics - syn6100412 Edited by Nicole: moved the quoted items to the main issue so that it counted it in the progress bar.

Aryllen commented 4 years ago

New methylationArray metadata file uploaded to staging. Added missing column arrayBatch and rearranged columns to match template order. This also has Sample_Name column, which I believe is the ID used in the methylation data, but is mapped to specimenIDs. Will need to verify and perhaps make a note of this in the methods.

Update: The Sample_Name column is what is used in the methylation data. Users will need to get the specimenID by cross-reference with Sample_Name. If I remember correctly, I think the Sample_Names were projids, which is why we had them create a mapping to new specimenIDs in the first place.

@amapeters, @karawoo, what route will we take on these cases? We could change the multi-specimen file to use the specimenID or we could make a note in the methods section (or some other visible location) that the IDs in the data refer to Sample_Name.

Aryllen commented 4 years ago

@amapeters, I'm looking at the rnaSeq covariates file. I'm trying to understand the different variables mentioned on the wiki and figure out where this data should "live." A lot of it looks like it should be in the individual metadata file (have not checked, yet) and the biospecimen metadata file. However, I'm confused as to what should be done with the individualIdentifer and individualIdentifier.inferred. One appears to be the individualID that the sample should have been from and the other appears to be the individualID that analysis from the assay says it should have been from.

Which of these IDs is the one that should be associated with the specimenID in the biospecimen metadata?

Aryllen commented 4 years ago

Ignore this. Redid the label free mass spec proteomics and comments can be found in new reply to this.

~~Label free mass spec proteomics (new metadata file here):~~

~~- CaseID was in the biospecimen metadata (with the exception of CaseID 0) so I made CaseID = specimenID~~

~~- 18 specimens where in batches 7 and 8. I currently have 'batch' as comma separated for these since there is no other batch information.~~

- This file has some rows that were in the proteomics covariates file, but have no information other than a RunName (?), batch number, and a CaseID of 0. Not sure what we should do with this data. CaseID 0 is not a specimenID in the biospecimen metadata.

~~- platform and assay were taken from the assay description.~~

- Data that should have been in the individual metadata file was matched by individualIdentifier and checked for consistency. There is data in the covariates file that is not in the individual file, however (bbscore, PlaqueMean, NP1).

karawoo commented 4 years ago

Regarding individualID vs individualID.inferred -- I believe the inferred ones should be more reliable. If I recall correctly, these came from resequencing data and discovering that some samples had been mislabeled or otherwise mixed up (@amapeters might remember better).

karawoo commented 4 years ago

Per discussion with Nicole, I will take on the WGS data first

Aryllen commented 4 years ago

Kara and I discussed some of the issues with the proteomics data. I will need to take the RunName, separate off the last section and compare that to the biospecimen specimenId and annotations. The CaseId 0 data does have files, but they are annotated with a portion of RunName as the specimenId.

karawoo commented 4 years ago

Added WGS file here: https://www.synapse.org/#!Synapse:syn22360825

The original WGS covariates file (https://www.synapse.org/#!Synapse:syn11384608) has the sampleIdentifier column which I remapped to specimenID. Other than that, none of the original columns seemed relevant to the assay metadata (they were all individual- or biospecimen-level information). I added platform and assay based on the assay description.

Aryllen commented 4 years ago

Mette and I discussed this today.

She requested that we focus on WES, WGS, and RNAseq. This data was done by Bin Zhang at Sinai (Ming Wui is data liason). Make sure that the biospecimen metadata maps to the individualIdentifer.inferred. Double check that the covariates data matches the data in the individual/biospecimen metadata.

Aryllen commented 4 years ago

RNA Seq metadata New rnaSeq metadata file uploaded here.

Notes:

used batch as sequencingBatch since there were duplicates. Not sure if this was the right choice between rnaBatch, libraryBatch, and sequencingBatch.
libraryPreparationMethod is TruSeq based on assay description
platform is HiSeq2500 based on assay description
runType is singleEnd based on assay description
libraryPrep is rRNAdepletion based on assay description
readLength is 100 based on assay description

Extra columns moved to metadata (outside the scope of the template requirements):

barcode
totalReads
mapped
rRNA.rate
Action

The specimenIDs in the biospecimen metadata appear to be matched with the individualIdentifier.inferred value with the exception of the following specimenIDs. The corresponding individualIdentifier.inferred values are "." in the covariates file, but the individualID listed is the individualIdentifier. Generally, this means the Action is "exclude" for this specimen's rnaSeq data.

BM_22_245_H154B394, BM_22_93_S113B355, hB_RNA_10432_K77C014, hB_RNA_11012, hB_RNA_12302, hB_RNA_12392_E007C014, hB_RNA_12744_L43C014, hB_RNA_13039_B82C014, hB_RNA_13320_P60C014, hB_RNA_13373, hB_RNA_13609_P60C014, hB_RNA_4782_L43C014, hB_RNA_4991, hB_RNA_5001, hB_RNA_7995_E007C014, hB_RNA_8255, hB_RNA_8475, hB_RNA_8515_K85C014, hB_RNA_8525_K85C014, hB_RNA_8855, hB_RNA_9140_K75rC014, hB_RNA_9190_E007C014, hB_RNA_9208_resequenced, hB_RNA_9226_K82C014

Gene counts Raw and normalized count file concerns:

specimenIDs are used, but have what appears to be the batch appended to the front. An example is "S109B355.BM_10_791". BM_10_791 is the specimenID and it does appear in the biospecimen and new rnaSeq metadata files.
The first column is the gene identifier and the rest of the columns are the counts per specimen. However, the first column header is a specimenID (in a format similar to the above). This throws off the columns, making them labeled wrong.

@amapeters, what do you suggest for these count file issues?

We also have both bam and fastq files for the raw data. I renamed the folder for now since it just said BAM.

Aryllen commented 4 years ago

WES metadata New wes metadata file uploaded here.

Notes:

sampleIdentifier used as specimenID
assay is exomeSeq
platform is HiSeq2500 according to description
runType is pairedEnd according to description
readLength is 125 according to description

Extra columns moved to metadata (outside the scope of the template requirements):

barcode
Action

The specimenIDs appear to be matched with the individualIdentifier.inferred as the individualID in biospecimen metadata. There are 4 exceptions where the inferred value is NA, in which case the individualID in biospecimen metadata is the individualIdentifier. These are:

BM_22_837, BM_22_912, BM_22_941, BM_22_956

There are two specimenIDs that are missing from the biospecimen metadata: hB_DNA_12775, BM_22_985. These have the Action of Exclude in the WES covariates file. I added these to the biospecimen metadata file. New biospecimen metadata file is here. Please use and create a new version of this new file for all future cleaning updates.

WES multispecimen files use the specimenIDs listed in the assay.

Aryllen commented 4 years ago

WGS multispecimen vcf files (checked chromosome 21) appear to use the specimenIDs in the assay metadata.

The following WGS specimens have "unknown" as the individualID in the biospecimen metadata.

71729, 71823, 71843, 71962, 76354, 76655

All of these except for 76354 have individualIdentifier values in the WGS covariates file. Updated the individualIDs in the biospecimen metadata file and uploaded new version. 76354 will have to stay as "unknown" for now.

Aryllen commented 4 years ago

Label free proteomics metadata Metadata file here.

Notes:

specimenID taken from portion of the RunName
assay and platform were taken from assay description
left in CaseID and batch since I am not sure if these are relevant or not

Most specimens were not in the biospecimen metadata. The missing samples were added and a new biospecimen metadata file uploaded. For these, the organ and tissue were gathered from the annotations on the proteomics data. The individualIDs were all in the individual metadata.

The following specimens in the biospecimen metadata file have "unknown" as the individualID and there is no individualIdentifier in the covariates file.

b1_bmgis_01, b1_bmgis_22, b1_bmgis_43, b2_bmgis_01, b2_bmgis_22, b2_bmgis_43, b3_bmgis_01, b3_bmgis_22, b3_bmgis_43, b4_bmgis_01, b4_bmgis_22, b4_bmgis_43, b5_bmgis_01, b5_bmgis_22, b5_bmgis_43, b6_bmgis_01, b6_bmgis_22, b6_bmgis_43, b7_bmgis_01, b7_bmgis_22, b7_bmgis_43, r2b7_bmgis_01

The proteomics covariates file has three columns of individual metadata that is not included in the actual individual metadata file: bbscore, NP1, and PlaqueMean. I have asked @amapeters if we should deprecate this information when we deprecate the covariates file or add extra columns to the individual metadata, even though we do not have this data for all patients.

The multispecimen file uses the form "Peptides specimenID". Do we want to remove the excess "Peptides " bit from these, @amapeters?

Aryllen commented 4 years ago

@karawoo, do you know how to open the TMT proteomics files (syn21347564) to see what they are using for identifying the specimens? The pepxml files are not opening for me (tried pepXMLtab from bioconductor so far), and I am not sure how to read the raw files at all. I tried googling for a second, but figured I would ask you before I go too deep in the weeds.

karawoo commented 4 years ago

I do not :(

Aryllen commented 4 years ago

individual metadata Removed row with individualID = unknown. There was no other information in that row. Accidentally pushed new version to the original file. 🤦‍♀️ Also uploaded to the staging cleaning folder in case it needs future updating. Will pull/push to the staged version, instead.

Aryllen commented 4 years ago

biospecimen metadata

Reminder to me: All individuals have ageDeath, but the biospecimen metadata is blank in the isPostMortem column. Should add 'True'.

Aryllen commented 4 years ago

TMT Proteomics annotations

The search files were annotated resourceType = analysis. I changed to experimentalData.
species updated to Human
fileformat for pepxml files set as xml
added isConsortiumAnalysis = False

Label free mass spec

Removed analysisType
Updated 'analysis' folder name to 'processed'
There are 23 specimenIDs in the annotations for which the individualID is missing in annotations and is 'Unknown' in the biospecimen file.
Updated missing diagnosis
added isConsortiumAnalysis = False

Aryllen commented 4 years ago

RNA seq annotations

added isConsortiumAnalysis = False
updated missing tissue, diagnosis, BrodmannArea
added libraryPrep, runType, readLength, dataSubtype

Aryllen commented 4 years ago

WES annotations raw data

extra annotations (center, disease, fileType, modelSystem, organism) set to NA for now (not sure how to easily remove annotation keys from all files)
platform, dataSubtype, isMultiSpecimen, organ, grant, tissue, diagnosis updated
added isConsortiumAnalysis, runType, readLength

processed data

removed analysisType
updated/added dataSubType, organ, platform, resourceType, runType, readLength, tissue, isConsortiumAnalysis

WGS annotations

updated isMultiSpecimen, tissue
using our paradigm, variant calling files would be dataType 'experimentalData', and analysisType 'NA'. Changed to this although I feel like this needs more discussion.
not sure what isConsortiumAnalysis should be since this shows joint and individual "analysis" on the folder names. Leaving off since we are questionable about this annotation anyway.

Aryllen commented 4 years ago

Updated assay for label free mass spec proteomics to label free mass spectrometry.
Updated resourceType to tool for LFMS proteomics search helper files.
Updated controlType in TMT quantitation assay file to have GIS for "Unknown" SampleID and NA for all others. New version here: syn22912257
Added rnaSeqSampleSwap study to the related studies.

Aryllen commented 4 years ago

Changed BrodmannArea to integers. Note that this is defined in the main synapseAnnotations repo. Updating to only allow integers would need an update to synapseAnnotations.
Added missing BrodmannArea and tissue based on whether we had one of the values. Note: can we truly assume the tissue is the same BrodmannArea? There are many BrodmannAreas in a given tissue, but it does seem like, at least for the ones we had data for, the area was always the same for the tissue.
Moved Action column from WES and rnaSeq metadata to biospecimen metadata. Uploaded new versions of all three.
Added notes in metadata wiki about Action column plus metadata in general.
Added checkmark to main issue section for adding metadata wiki to methods. This should only be done after the metadata has been confirmed/approved and moved out of staging.
Updated WGS and WES metadata columns to match fixed templates.

Aryllen commented 4 years ago

Sent email to Bin and Minghui, with changes requested from Mette and slight reformatting.

Hi Bin and Minghui,

As you know, we have been working on getting older studies up to date on the new standards that we have for data in the AD Knowledge Portal. In general, this means ensuring the metadata matches our metadata templates, and checking that the specimen identifiers (specimenID) in files link back to the metadata.

Since MSBB is one of our most valuable datasets, we want to make sure it is usable for as many people as possible and can easily be harmonized with the other portal data. To complete this, we have a few questions.

We noticed that the RNA sequencing counts files have a couple issues. The header appears to have been shifted left, which makes the first specimen name the header for the ensemble IDs. The specimenIDs also appear to be prepended with batch information, which is already in the metadata files. Would it be acceptable to you if we: shifted the header to the right by 1 (this would align the specimen names with the count columns), added 'ensemble id' as the first header value to signify what is in that column, and removed the prepended batch information from the specimen names (the specimenID would be leftover and could be matched to the batch information in the metadata files)?

We have taken the covariate files and created new metadata files from these. Information related to individuals has been moved to the individual metadata file, while information related to biospecimens and their respective assays have been moved to the biospecimen and assay metadata files. These files will replace the covariate files currently in the metadata folder. The covariate files will still be accessible, but we will not surface them in the AD Knowledge Portal. Can you please check the the metadata files for data that your group generated and verify that they appear correct? The metadata files are in a staging folder, but Minghui should have permissions to view them.

Along with checking the metadata files, please note that we have kept the information about the QC remapping in the biospecimen file. Since the remapped specimens are now linked to the correct individual, can we remove the remapping information? We are a bit concerned that it will be confusing. We suggest leaving the 'Exclude' information in, but would like to add text explaining why you recommend they should be excluded. Can we state the following? "A specimen QC identified samples that could not be mapped to the expected individual. These have been indicated as 'Exclude" in the biospecimen metadata file".

Best,

Minghui's response:

Nicole, Great your team can reformat the data files.

Regarding the RNA sequencing counts files, it is totally fine if you shift the header and add in a new column id. Batch information can be discarded from the sample id as long as such information is kept in the associated metadata file.

I will look at the metadata file when I have a chance.

Fine with me.

Minghui

Biospecimen file has been updated to remove all "Action"/remapping values except for "Exclude" and individualIDs for the excluded samples have been removed. Moved the metadata wiki info to the staging folder since it might be confusing if people stumbled on that before we updated things. Changed note about the "Action" column to be the information about some being excluded as Mette suggested.

Aryllen commented 4 years ago

RNA seq count files have been updated and uploaded to a staging folder (syn22988362):

shifted header to the right by 1
added 'Ensembl ID' as first column name
removed batch information that was prepended to the specimenID
checked that batch information was in the rnaSeq assay metadata and all specimenIDs were in the biospecimen metadata

Aryllen commented 4 years ago

Proteomics metadata

changed gis header to controlType
put value of GIS for all specimens with 'gis' in the specimenID

TMT quantitation metadata

removed SampleID column
removed values from MulticonsensusStudyFileID since they do not follow the definition of the key (will ask Minghui if they have that information)

Aryllen commented 4 years ago

Updated study in portal studies table.

Question: Grant R01AG050986 not in projects table. This is the grant for the ATACseq data. Should we be adding this to the table? leave it. Don't put in table.
Question: Data contributor on study card. Should this only be MSSM? add emory

Wiki layouts

Analysis:

The network section for coexpression looked like a wall of text and the notices looked out of place. I moved the notices to the end of the paragraphs and changed 'Notice' to 'Note'. This is still a wall of text, but it looks a bit better. This format matches other wiki formats, otherwise I would have added some extra space.
Added Bayesian network to main methods page. Made it look better with a link to the paper and removed an old, broken link.
Question: Is there a reason why we have all the analysis methods on one wiki? The data is in separate folders so we could put the wiki's on the respective folders and add them to the study table. I put them on the wikis already. Just need to figure out folder name structure and if that's what we want to do.

WGS:

Question: I tried to fix this a little to make more sense, but it still seems odd. Should we leave it as it is now?

Proteomics:

Question: Not wiki-related, but why is the folder in here called 'MSBB (link to release)'? What would be a better name for this?

TMT Quantitation:

Updated references to a bullet list so it looked nicer.

Annotations I have already checked the annotations, but we have since updated the requirements to prepare for creating a schema. We should use the schema (when it's ready) to audit these annotations, again.

Aryllen commented 4 years ago

Update on MulticonsensusStudyFileID: This definition is tied to the use of a specific program. They did not use it and do not have this information. This column should be empty in the TMT metadata.

Aryllen commented 4 years ago

Email sent to Duc to check over the proteomics metadata.

Have not heard a 'go-ahead' from Minghui regarding the rest of the metadata, yet.

Aryllen commented 4 years ago

INPP5D (in the staging folder) is a different study that is from a grant we support. Need to move it into it's own study folder. @amapeters to follow up with Minghui on when this should be published. Will need to verify that it fits our model (metadata, descriptions, folder structure, etc). Created issue in AD-DCC to track (https://github.com/Sage-Bionetworks/AD-DCC/issues/692). Update: moved out of MSBB staging.

Aryllen commented 4 years ago

Minghui sent corrected biospecimen and RNAseq metadata files. The corrections in the RNAseq metadata are things that I wouldn't know how to verify (sequencingBatch, RIN, etc). It seems like Minghui ended up shuffling the rows. The biospecimen metadata is a problem in that Minghui removed 290 rows, all of which were for proteomics data except for 1. Again, it looks like there was also a shuffling of rows. I am not sure what exactly was changed in each row, yet, but did respond to Minghui asking for a reason why he removed the data.

Update: Minghui claims that these were duplicates, but it's specimens used by the proteomics data. I said I would take a closer look, but will most likely need to tell Minghui, 'thanks for the update, but we are keeping them'.

amapeters commented 4 years ago

I reviewed the MSBB biospecimen metadata file: https://www.synapse.org/#!Synapse:syn22453847

Add methylationArray under notes for these specimens "KWEK894", "PCZI872" etc
I see what Minghui is referring to regarding the label free mass spec. It is duplicated. This is because the annotations are not the full biospecimen IDs. For example, this file https://www.synapse.org/#!Synapse:syn6038904 has biospecimenID = 1497. The assay metadata has "b1_1497_21' (which is in the filename), but the processed data (https://www.synapse.org/#!Synapse:syn6100414) reverses it and adds 'peptides' to the name 'Peptides b1_21_1497'. Here is my suggestion: Change the annotations to this format 'b1_1497_21' and update the processed data to the same

amapeters commented 4 years ago

Hi Duc,

We are doing some cleanup of our high value studies, including MSBB. We noticed that the sample identifier in the protein output file is flipped compared to the identifier on the .raw files (which we use as the specimen ID). For example, MSBB_Proteomics_PFC_RAW_b1_1497_21.raw ('b1_1497_21') is labeled as 'b1_21_1497' in the protein output file. Do you mind if we update the protein output file with the raw file ids? We will provide a new version (ie, not delete the current file) and provide an explanation in our release notes.

Best, Mette

amapeters commented 4 years ago

Hi Mette,

Thanks for doing these cleanup steps. It really helps as we are sometimes blinded to these discrepancies. Please go ahead with the name change.

Best, Duc

Aryllen commented 4 years ago

Proteomics

Updated specimenID annotations to be full specimenID instead of truncated
Updated specimenID in multispecimen file to have correct, full ID. New file in staging here. When approved, needs to be uploaded as new version of the actual file.
- Note: Some specimens appear to have multiple runs. This is reflected in the filename, where some have r2b####### and some have b#bmgis##_b. The column names in the multispecimen file have these parts mixed up, where r2b# becomes b#r2 and bmgis##b becomes ##bmgis_b. I have switched these around to more closely match the specimenIDs and filenames.

Biospecimen

Updated notes column in Minghui's corrected version to have 'methylationArray' for related specimenIDs
Uploaded Minghui's corrected version as new version of main file (i.e. not the one labeled "_corrected" in the name)
Checked that all specimenIDs in the annotations are in the biospecimen metadata file. Missing:
- Proteomics GIS controls. These are in the proteomics assay file, but not in the biospecimen. This could make sense to leave as is since the GIS's are indicated in the assay metadata.
- hB_RNA_10892, an rnaSeq specimen.
- Question: What should the annotations be on the files that have this specimenID? All rnaSeq files with specimenIDs associated with this one can be seen here. The alternative would be adding the specimenID to the biospecimen metadata.

Aryllen commented 4 years ago

Ready to deprecate old covariate files and move metadata to public.

Aryllen commented 4 years ago

@amapeters, I checked and the specimenID in question IS in the assay metadata. However, it exactly matches the information for hB_RNA_10892_K77C014. I will ask Minghui if the data should be annotated with the longer specimenID (the one in the biospecimen metadata).

Aryllen commented 4 years ago

There was a discussion thread where Minghui clarified the issue. I updated the names of the files and the annotations to match the batch information, as specified by Minghui.

Aryllen commented 4 years ago

Updated portal study table to have the analysis wikis separately under methods.
Updated metadata information to match the portal info, but with 'special considerations' for the MSBB metadata. Changed the wiki reference in the portal study table.
Covariate files in Metadata moved to deprecated Covariates folder.
Metadata files updated/created were either moved to the Metadata folder or uploaded as a new version of the current files.
Public release notes written.

This is 'technically' finished. Final touches would be to:

[x] sanity check annotations since we were in the process of working on the minimum annotation set while working on this
[x] write consortium release notes

Aryllen commented 3 years ago

Made a few updates based on Mette's feedback.

Closing because this is now done.

Sage-Bionetworks / cleanAD

MSBB #2