IGS / gEAR

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.
https://umgear.org
GNU Affero General Public License v3.0
13 stars 4 forks source link

Identify existing columns in gEAR metadata to which we can apply ontologies #216

Open mgiglio99 opened 2 years ago

mgiglio99 commented 2 years ago

1 - instrument_model can be captured with OBI device terms Tasks: --find existing terms for devices --identify devices for which there are no OBI terms --for datasets without platform_ids, search for device information

2 - library_strategy captures things like: RNAseq, single-cell RNAseq, scRNAseq, microarray However, similar info is also captured in dtype 3 - dtype appears to correspond to dataset_type in the metadata_template has values: svg-expression, single-cell-rnaseq, bulk-rnaseq, microarray, epiviz Deprecated values are - bargraph-standard, image-static, image-static-standard, linegraph-standard violin-standard As mentioned above, these overlap with info in library_strategy Questions: --what exactly is meant to be captured in each of these fields? --Should we just leave the deprecated values for dtype as is? What will happen to those datasets/rows?

Tasks: --identify EDAM data type, file format terms relevant for above --identify OBI assay terms relevant for above once above question in answered --apply EDAM data type, file format and OBI assay terms as appropriate for each above field

4 - library_selection and 5 - library_source Question: What are these meant to capture and how do they relate to library_strategy and dtype/dataset_type

adkinsrs commented 2 years ago

Going to reopen this for now, so that @jorvis and I can assess. Unless @jaluvathingal has already completed this.

mgiglio99 commented 2 years ago

I think Jain likely closed this by mistake. Work on this is ongoing.

jaluvathingal commented 2 years ago

I’m sorry I closed it by mistake. Thanks Michelle.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Michelle @.> Sent: Sunday, February 13, 2022 4:34:13 PM To: IGS/gEAR @.> Cc: George, Jain @.>; Mention @.> Subject: Re: [IGS/gEAR] Identify existing columns in gEAR metadata to which we can apply ontologies (Issue #216)

I think Jain likely closed this by mistake. Work on this is ongoing.

— Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FIGS%2FgEAR%2Fissues%2F216%23issuecomment-1038441963&data=04%7C01%7Cjgeorge%40som.umaryland.edu%7C6c7b964afb764010e7c908d9ef389504%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637803848573707462%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yETpBwMmVtGqITEuLn6lhE33j%2BbHfJcIkeVFctDqBLU%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJEEDBUXIVAM7OQJNJ5P6Q3U3APVLANCNFSM5OE5GQFQ&data=04%7C01%7Cjgeorge%40som.umaryland.edu%7C6c7b964afb764010e7c908d9ef389504%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637803848573707462%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=mfSW5RvJuLgcShxVvXMaLuJyykBuH4hECvODXLvnqa0%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cjgeorge%40som.umaryland.edu%7C6c7b964afb764010e7c908d9ef389504%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637803848573707462%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=RRG60VNf2VMgKq6hEQgEot37K%2BEik3ffkDDpqxob%2FdU%3D&reserved=0 or Androidhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cjgeorge%40som.umaryland.edu%7C6c7b964afb764010e7c908d9ef389504%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637803848573707462%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PKdX73Kc9cxtMMLb2XzUaGQfWGZ%2F%2FmheKcYMjGiabfg%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

mgiglio99 commented 2 years ago

Discussion on 3/11

2/3 - library strategy is a GEO field. Based on what GEO has, it appears to be on an assay level description - they have a list of terms that is incomplete (no scRNAseq, no bulk RNAseq). We can explore with them getting more things added. AI - contact them. Page on submitting data to GEO: https://www.ncbi.nlm.nih.gov/geo/info/seq.html https://www.ncbi.nlm.nih.gov/geo/info/soft-seq.html AI - map existing GEO terms to OBI assay terms AI - see if there are MIXs standards for any of these fields in SRA or GEO

dataset_type is to direct the website to know what data is appropriate for each part of the website (e.g. scRNAseq data to the single cell portal). It's a gEAR specific thing. There are four values in drop down - anything in the current dump that is not from this list should be deprecated datasets. AI - Jain check on this.

4/5 - these fields are not used by gEAR but are autofilled by GEO - they can stay as is, but are not used by gEAR.

Useful to keep instrument_model AI - Add a field for the OBI device term.

There is a library construction field in GEO but not in the gEAR template. That info is wanted. Can add a field for gEAR metadata that is perhaps library_construction - but with a controlled set of terms rather than free text as GEO has. AI - talk with Luke about what kits/methods are used for library construction to build this list. Look for any vocabularies that already capture this.

mgiglio99 commented 2 years ago

List of technologies in gEAR as of 3/11

Affymetrix Clariom S Assay Affymetrix GeneChipH MOE 430 2.0 arrays Affymetrix Mouse Gene 2.1 ST Array Affymetrix Mouse Gene ST 1.0 arrays Agilent-026655 Whole Mouse 4x44K v2 Agilent-030493 SurePrint G3 Mouse Exon 4x180K Custom Affymetrix Mouse Exon Junction Array Fluidigm C1 GeneChip Mouse Genome 430A2.0 Array HiSeq 2000 HiSeq 2500 HiSeq 4000 HiSeq4000 HuGene-2_0-st illumina Illumina Hi-Seq 2000 Illumina HiSeq 1000 Illumina HiSeq 1000,Illumina HiSeq 1500 Illumina HiSeq 1500 Illumina HiSeq 1500, Illumina NextSeq 500 Illumina HiSeq 2000 Illumina HiSeq 2000 Illumina HiSeq 2500 Illumina HiSeq 2500 Illumina HiSeq 3000 Illumina Hiseq 4000 Illumina HiSeq2500 Illumina HiSeq4000 Illumina MouseRef-8 v2.0 beadchip Illumina NextSeq Illumina NextSeq 500 Illumina NextSeq 500 Illumina NextSeq500 Illumina novaseq Illumina NovaSeq 6000 Illumina NovaSeq 6000 Illumina-Hiseq2500 MoGene-2_1-st Mouse Genome 430 2.0 Affymetrix gene chips NextSeq 500/550 High Output flow cell NextSeq 550 NovaSeq 6000 NovaSeq6000 RNAscope HiPlex

mgiglio99 commented 2 years ago

Discussion on 4/1

We have to walk the line between encouraging data submission by making it easy, but also keeping the data consistent and structured. All of this needs to be able to scale across gEAR implementations.

For Jain/Michelle/Joe now: -Jain will incorporate her work so far into a draft revision to the metadata template with regard to new fields and CVs for existing fields and pass on to Michelle. Idea is to get new data submissions started on a new path of FAIRness asap. Once ready to share with whole group, create issue and post. -Jain will contact GEO about additions to their CVs and how often the CVs change -Jain will finalize OBI device term requests and pass on to Suvvi/Michelle -Ask Ronna about recording of Jayram's review of their recent work with regard to cell types, etc. soon: -Jain will work on the update tables to give to Joshua for use in updating the existing gEAR metadata -Joe will walk Jain and Michelle through all of the gEAR functions and how they relate to dataset metadata and data columns within the observation files later: -we will tackle the multiple cans of worms associated with the columns of data in the observations files -should add a field to capture normalization_method

For Joshua, -eventually, make it so that submitters input NCBI taxon id and the upload software pulls the taxon name from NCBI's taxonomy. AND/OR when a GEO id is given, the NCBI taxon id is pulled from GEO

jaluvathingal commented 2 years ago

While working on collecting more information on the public datasets already existing in GEAR, found some mismatches between gear website/dataset dump and GEO/Pubmed, mainly with the instruments used, organisms, datatypes etc.

Attaching the spreadsheet with some mismatches noted. Comments added in the last column (AI).

GEAR_public_dataset_corrections.xlsx

jaluvathingal commented 2 years ago

New terms/fields suggestions to add to the user template, to discuss during todays meeting:

GEAR_user_template_new_terms.xlsx